Last updated: April 19, 2026
Application No. 18/208,190
GENERATING SUMMARY PROMPTS WITH VISUAL AND AUDIO INSIGHTS AND USING SUMMARY PROMPTS TO OBTAIN MULTIMEDIA CONTENT SUMMARIES

Final Rejection §103
Filed
Jun 09, 2023
Examiner
ADESANYA, OLUJIMI A
Art Unit
2658
Tech Center
2600 — Communications
Assignee
Microsoft Technology Licensing, LLC
OA Round
2 (Final)
Interview Optional

— +25.5% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 655 resolved cases, 2023–2026
Examiner Intelligence

ADESANYA, OLUJIMI A View full profile →
Grants 66% — above average
Career Allow Rate
430 granted / 655 resolved
+3.6% vs TC avg
Strong +26% interview lift
Without
With
+25.5%
Interview Lift
resolved cases with interview
Typical timeline
3y 6m
Avg Prosecution
35 currently pending
Career history
690
Total Applications
across all art units
Statute-Specific Performance

§101
19.3%
-20.7% vs TC avg
§103
40.6%
+0.6% vs TC avg
§102
17.7%
-22.3% vs TC avg
§112
12.9%
-27.1% vs TC avg
Black line = Tech Center average estimate • Based on career data from 655 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
Applicant's arguments filed 12/1/25 have been fully considered but they are not persuasive. 
Regarding claim 9, Applicant argues that Schalkwyk fails to disclose “visual insights that include object labels for objects visualized in the visual content, and identity labels for people represented in the visual content” i.e., limitations “applying one or more image processing models to the visual content to thereby generate visual insights from the visual content, the visual insights including (ii) object labels for objects visualized in the visual content, the object labels being generated by one or more object identification models of the one or more image processing models, and (iii) identity labels for people represented in the visual content, the people being identified by one or more facial recognition models of the one or more image processing models” (Arguments, pg. 31, second para. – pg. 33, second para.). Examiner respectfully disagrees
                Schalkwyk discloses its object recognition (OCR) module 240 as being configured to recognize any text or symbols present in image frames of the image data using a trained OCR machine learning model (para. [0037]; para. [0042]), where objects include texts/words (para. [0031]), corresponding to limitation “applying one or more image processing models to the visual content to thereby generate visual insights from the visual content, the visual insights including (ii) object labels for objects visualized in the visual content, the object labels being generated by one or more object identification models of the one or more image processing models”. Schalkwyk also discloses its diarization module 220 generating diarization results 224 that include a corresponding speaker label 226 assigned to each audio segment 222 based on audio data 122 of audio-visual content 120 and using a face tracking routine to identify which participant is speaking during a segment 222 (para. [0038]-[0039]), corresponding to limitation “the visual insights including (iii) identity labels for people represented in the visual content, the people being identified by one or more facial recognition models of the one or more image processing models”.
 Regarding claim 15, Applicant argues that Lin discloses removing duplicate photos/images in the context of image curation and fails to disclose limitation “generating an aggregated timeline of the audio insights and the visual insights by temporally aligning the audio insights and the visual insights, and by removing duplicate visual insights when generating the visual insights from the visual content or by removing duplicate audio insights when generating the audio insights from the audio content;” (Arguments, pg. 33, third para. – pg. 35, second para.)
                Examiner respectfully disagrees. In response to applicant's argument that Liu’s invention is used in the context of image curation, a recitation of the intended use of the claimed invention must result in a structural difference between the claimed invention and the prior art in order to patentably distinguish the claimed invention from the prior art. If the prior art structure is capable of performing the intended use, then it meets the claim. In this case, Liu provides for video/audio-visual summarization via image curation. Schalkwyk, which is also involved in video/audio-visual summarization is already applied to teach limitation “generating an aggregated timeline of the audio insights and the visual insights by temporally aligning the audio insights and the visual insights” (para. [0038]-[0039]). What Schalkwyk does not explicitly disclose includes “generating an aggregated timeline of the audio insights and the visual insights by removing duplicate visual insights when generating the visual insights from the visual content or by removing duplicate audio insights when generating the audio insights from the audio content;”. Lin discloses this limitation. Lin discloses analyzing labeled images using its curation application to identify persons/identity labels/visual insights in the images (para. [0038]; para. [0045]), and removing duplicate images having the identified persons based on a completeness of the important digital images involving the identified persons prior to generation of a final curation (para. [0038]; para. [0077]), corresponding to limitation “generating an aggregated timeline of the audio insights and the visual insights by removing duplicate visual insights when generating the visual insights from the visual content or by removing duplicate audio insights when generating the audio insights from the audio content;”.
Applicant’s arguments with respect to claim 1 and reference Schalkwyk not disclosing limitation “the audio insights comprising (ii) nonspeech sound labels corresponding to nonspeech sounds, the nonspeech sound labels being generated using one or more nonspeech sound models of the one or more audio processing models” (Arguments, pg. 28, fifth para. – pg. 31, first para.), and arguments with respect to claim 9 and reference Schalkwyk not disclosing limitation “the visual insights including (i) text visualized in the visual content the text being identified by one or more OCR (Optical Character Recognition) models of the one or more image processing models,” (Arguments, pg. 31, second para. – pg. 33, second para.), have been considered but are moot in light of new grounds of rejection with reference Mohanty as presented below.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

1.          Claims 1-6, 8-12 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Schalkwyk et al US 2023/0281248 A1 (“Schalkwyk”) in view of Mohanty et al US 2023/0403174 A1 (“Mohanty”) and Liu et al US 2024/0177084 (“Liu”)
         Per claim 1, Schalkwyk discloses a method for generating summary prompts from multimedia content, the method comprising:
               accessing the multimedia content, the multimedia content comprising audio content and visual content (a system 100 includes a user 2 viewing a content feed 120 played back on a computing/user device 10 through a media player application 150. \… In the example shown, the content feed 120 includes a recorded instructional cooking video played back on the computing device 10 for the user 2 to view and interact with. While examples herein depict the content feed 120 as an audio-visual (AV) feed (e.g., a video) …, para. [0028]);
             applying one or more audio processing models to the audio content to thereby
generate audio insights from the audio content, the audio insights comprising , the coherent transcript being generated using one or more speech-to-text models of the one or more audio processing models (the ASR module 230 and/or the diarization module 220 (or some other component of the application 150) may index a transcription 310 of the audio data 122 using the time-stamped speaker labels 226 predicted for each segment 222 obtained from the diarization results 224. As shown in FIG. 2, the transcription 310 for the content feed 120 may be indexed by speaker to associate portions of the transcript 202 with the respective speaker …, para. [0039]), and 
                   applying one or more image processing models to the visual content to thereby generate visual insights from the visual content, the visual insights including at least one of (i) text visualized in the visual content, the text being identified by one or more OCR (Optical Character Recognition) models of the one or more image processing models, (ii) object labels for objects visualized in the visual content, the object labels being generated by one or more object identification models of the one or more image processing models, and (iii) identity labels for people represented in the visual content, the people being identified by one or more facial recognition models of the one or more image processing models (generate diarization results 224 that include a corresponding speaker label 226 assigned to each segment 222 using a probability model (e.g., a probabilistic generative model) based on the audio data 122 (and optionally the image data 124).… the diarization module 220 may simultaneously execute a face tracking routine to identify which participant is speaking during which segment 222 …, para. [0038]-[0039]);
               generating coherent segments of an aggregated timeline of the audio insights and the visual insights, the aggregated timeline comprising a temporal alignment of the audio insights and the visual insights, each of the coherent segments including a unique combination of audio insights and visual insights (segment the audio data 122 into a plurality of segments 222, 222a-n (e.g., fixed-length segments or variable-length segments), and generate diarization results 224 that include a corresponding speaker label 226 assigned to each segment 222 using a probability model (e.g., a probabilistic generative model) based on the audio data 122 (and optionally the image data 124)…., para. [0038]; the ASR module 230 and/or the diarization module 220 (or some other component of the application 150) may index a transcription 310 of the audio data 122 using the time-stamped speaker labels 226 predicted for each segment 222 obtained from the diarization results 224. As shown in FIG. 2, the transcription 310 for the content feed 120 may be indexed by speaker to associate portions of the transcript 202 with the respective speaker, para. [0039]); 
               grouping the coherent segments into a set of chunks (fig. 2, para. [0036]; elements 204, 224; segment the audio data 122 into a plurality of segments 222, 222a-n (e.g., fixed-length segments or variable-length segments) …, para. [0038]-[0039]; para. [0044]; The structured document 300 may further provide a summary 410 of relevant chapters/sections/scenes of the audio-visual feed 120 for presentation in the video interface 400…., para. [0054]);
               generating a summary prompt for each chunk in the set of chunks for submission to a model to thereby generate a summary for each chunk in the set of chunks, the model trained to generate summaries from summary prompts, the summary prompt being based on (i) the audio insights and visual insights of the coherent segments of each chunk (fig. 2, elements 180, 300; para. [0036]; the large language model 180 is configured to receive the semantically-rich, structured document 300 and the query 112 issued by the user 2 as input …  the large language model 182 may perform other generative tasks such as generating natural language text that summarizes one or more portions of the structured document 300, para. [0047], summary input into large language model 180 as summary prompt)
                 Schalkwyk does not explicitly disclose the audio insights comprising (ii) nonspeech sound labels corresponding to nonspeech sounds, the nonspeech sound labels being generated using one or more nonspeech sound models of the one or more audio processing models
                However, this feature is taught by Mohanty (para. [0041]; audio processing module 210 may timestamp the speech segments and the non-speech segments of the audio stream to identify when the individual speech segments occurred.…, para. [0042])
                Schalkwyk in view of Mohanty does not explicitly disclose grouping the coherent segments into a set of chunks based on a predetermined prompt size, identifying a selected summary style that is selected from a plurality of different summary styles or generating a summary prompt for each chunk in the set of chunks based on (ii) the selected summary style
               However, these features are taught by Liu:
               grouping the coherent segments into a set of chunks based on a predetermined prompt size (para. [0074])
               identifying a selected summary style that is selected from a plurality of different summary styles (method 1200 may also request the LLM to create 1214 a longer summary of the combined summarized transcripts. The summaries of different lengths give users different options for the amount of detail they want in a summary., para. [0086]); and
               generating a summary prompt for each chunk in the set of chunks based on (ii) the selected summary style (An example prompt for this type of transcript summarization is provided below, para. [0080]; Only return a one-paragraph summary—do not include any titles or headings, para. [0081]; If the method determines that the combined summary information is useful, the combined summarized transcripts are provided to the LLM to create 1212 a short summary of the combined summarized transcripts …, para. [0084])
             It would have been obvious to one of ordinary skill in the art before the effective filing of the invention to combine the teachings of Mohanty with the method of Schalkwyk in arriving at the missing features of Schalkwyk, as well as to combine the teachings of Liu with the method of Schalkwyk in view of Mohanty in arriving at the missing features of Schalkwyk in view of Mohanty, because such combination would have resulted in allowing for an efficient and comprehensive review of the salient information of the meeting in a concise manner by users (Mohanty, para. [0023]; para. [0041]) and in identifying important information of a meeting according to the amount of detail preferred by a user (Liu, para. [0062]; para. [0086]).
             Per claim 2, Schalkwyk in view of Mohanty and Liu discloses the method of claim 1, 
                   Liu discloses wherein the method further comprises providing the summary prompt for each chunk to a model trained to generate summaries from summary prompts (para. [0072]; An example prompt for this type of transcript summarization is provided below, para. [0080]; Only return a one-paragraph summary—do not include any titles or headings, para. [0081]; If the method determines that the combined summary information is useful, the combined summarized transcripts are provided to the LLM to create 1212 a short summary of the combined summarized transcripts …, para. [0084]).
            Per claim 3, Schalkwyk in view of Mohanty and Liu discloses the method of claim 2, 
                    Liu discloses wherein the model comprises a large language model (LLM). (para. [0084]).
             Per claim 4, Schalkwyk in view of Mohanty and Liu discloses the method of claim 2, 
                  Liu discloses wherein the method further comprises obtaining a plurality of summaries from the model comprising a summary for each chunk and combining the plurality of summaries into a new summary prompt (The method continues by summarizing 1204 each of the meeting transcripts using an LLM.…, para. [0080]; method 1200 may also request the LLM to create 1214 a longer summary of the combined summarized transcripts…., para. [0086]-[0087]).
              Per claim 5, Schalkwyk in view of Mohanty and Liu discloses the method of claim 4,
                  Liu discloses wherein the method further comprises providing the new summary prompt to the model and obtaining a new summary from the model in response to providing the new summary prompt to the model (The method continues by summarizing 1204 each of the meeting transcripts using an LLM.…, para. [0080]; method 1200 may also request the LLM to create 1214 a longer summary of the combined summarized transcripts…., para. [0086]-[0087]).
            Per claim 6, Schalkwyk in view of Mohanty and Liu discloses the method of claim 1, 
             Schalkwyk discloses wherein the method further comprises generating the audio insights by at least performing speech-to-text and diarization processing on the audio content (the ASR module 230 and/or the diarization module 220 (or some other component of the application 150) may index a transcription 310 of the audio data 122 using the time-stamped speaker labels 226 predicted for each segment 222 obtained from the diarization results 224. As shown in FIG. 2, the transcription 310 for the content feed 120 may be indexed by speaker to associate portions of the transcript 202 with the respective speaker …, para. [0039]).
           Per claim 8, Schalkwyk in view of Mohanty and Liu discloses the method of claim 1, 
               Schalkwyk discloses wherein the method further comprises linking two temporally adjacent chunks in the set of chunks with a linking segment from the coherent segments by including the linking segment into both of the two temporally adjacent chunks (the diarization results 224 provide time-stamped speaker labels 226, 226a-n for the received audio data 122 that not only identify who is speaking during a given segment 222, but also identify when speaker changes occur between adjacent segments 222  …, fig. 2, element 204, 224; para. [0038]-[0039], speaker change as included between adjacent segments).
             Per claim 9, Schalkwyk discloses a method for generating a summary of multimedia content, the method comprising:
                accessing the multimedia content, the multimedia content comprising audio content and visual content (a system 100 includes a user 2 viewing a content feed 120 played back on a computing/user device 10 through a media player application 150. \… In the example shown, the content feed 120 includes a recorded instructional cooking video played back on the computing device 10 for the user 2 to view and interact with. While examples herein depict the content feed 120 as an audio-visual (AV) feed (e.g., a video) …, para. [0028]);
                applying one or more audio processing models to the audio content to thereby
generate audio insights from the audio content, the audio insights comprising at least one of (i) a coherent transcript that comprises textual representations of spoken utterances contained in the audio content and speaker identifications for the spoken utterances, the coherent transcript being generated using one or more speech-to-text models of the one or more audio processing models, or (ii) nonspeech sound labels corresponding to nonspeech sounds, the nonspeech sound labels being generated using one or more nonspeech sound models of the one or more audio processing models (the ASR module 230 and/or the diarization module 220 (or some other component of the application 150) may index a transcription 310 of the audio data 122 using the time-stamped speaker labels 226 predicted for each segment 222 obtained from the diarization results 224. As shown in FIG. 2, the transcription 310 for the content feed 120 may be indexed by speaker to associate portions of the transcript 202 with the respective speaker …, para. [0039]);
              applying one or more image processing models to the visual content to thereby generate visual insights from the visual content, the visual insights including , the object labels being generated by one or more object identification models of the one or more image processing models, and (iii) identity labels for people represented in the visual content, the people being identified by one or more facial recognition models of the one or more image processing models (para. [0031]; the document structurer 200 also processes the image data 124 to determine whether any creator-provided text 320 is recognized in the image data 124. In these implementations, the structured document 300 generated by the document structurer 200 will also include any creator-provided text 320 recognized in one or more image frames 125a-n (FIG. 2) of the image data 124 …, para. [0035]; an object character recognition (OCR) module 240 …, para. [0037]; generate diarization results 224 that include a corresponding speaker label 226 assigned to each segment 222 using a probability model (e.g., a probabilistic generative model) based on the audio data 122 (and optionally the image data 124).… the diarization module 220 may simultaneously execute a face tracking routine to identify which participant is speaking during which segment 222 …, para. [0038]-[0039]; The OCR module 240 is configured to recognize any creator-provided text 320 that may be present in one or more image frames 125a-n of the image data 124. The OCR module 240 may include an OCR machine learning model (e.g., recognizer) 244 trained to recognize any creator-provided text 320 in each image frame 125…., para. [0042], module 240 as performing object recognition);
               generating an aggregated timeline of the audio insights and the visual insights by temporally aligning the audio insights and the visual insights (segment the audio data 122 into a plurality of segments 222, 222a-n (e.g., fixed-length segments or variable-length segments), and generate diarization results 224 that include a corresponding speaker label 226 assigned to each segment 222 using a probability model (e.g., a probabilistic generative model) based on the audio data 122 (and optionally the image data 124)…., para. [0038]-[0039]);
              segmenting the aggregated timeline into coherent segments, each of the coherent segments including a unique combination of audio insights and visual insights  (segment the audio data 122 into a plurality of segments 222, 222a-n (e.g., fixed-length segments or variable-length segments), and generate diarization results 224 that include a corresponding speaker label 226 assigned to each segment 222 using a probability model (e.g., a probabilistic generative model) based on the audio data 122 (and optionally the image data 124)…., para. [0038]; the ASR module 230 and/or the diarization module 220 (or some other component of the application 150) may index a transcription 310 of the audio data 122 using the time-stamped speaker labels 226 predicted for each segment 222 obtained from the diarization results 224. As shown in FIG. 2, the transcription 310 for the content feed 120 may be indexed by speaker to associate portions of the transcript 202 with the respective speaker, para. [0039]);
             grouping the coherent segments into a set of chunks (fig. 2, para. [0036]; elements 204, 224; segment the audio data 122 into a plurality of segments 222, 222a-n (e.g., fixed-length segments or variable-length segments) …, para. [0038]-[0039]; para. [0044]; The structured document 300 may further provide a summary 410 of relevant chapters/sections/scenes of the audio-visual feed 120 for presentation in the video interface 400…., para. [0054]);
             generating a summary prompt for each chunk in the set of chunks based on (i) the audio insights and visual insights of the coherent segments of each chunk (fig. 2, elements 180, 300; para. [0036]; the large language model 180 is configured to receive the semantically-rich, structured document 300 and the query 112 issued by the user 2 as input …, para. [0047], summary input into large language model 189 as summary prompt)
              Schalkwyk does not explicitly disclose the visual insights including (i) text visualized in the visual content, the text being identified by one or more OCR (Optical Character Recognition) models of the one or more image processing models,
              However, this feature is taught by Mohanty (para. [0054]-[0055])
              Schalkwyk in view of Mohanty does not explicitly disclose grouping the coherent segments into a set of chunks based on a predetermined prompt size, identifying a selected summary style that is selected from a plurality of different summary styles, generating a summary prompt for each chunk in the set of chunks based on (ii) the selected summary style, providing the summary prompt for each chunk to a model trained to generate summaries from summary prompts, obtaining a plurality of summaries from the model comprising a separate summary for each summary prompt received in response to providing each summary prompt to the model or combining the plurality of summaries into a single summary
              However, these features are taught by Liu:
              grouping the coherent segments into a set of chunks based on a predetermined prompt size (para. [0074]);
              identifying a selected summary style that is selected from a plurality of different summary styles (method 1200 may also request the LLM to create 1214 a longer summary of the combined summarized transcripts. The summaries of different lengths give users different options for the amount of detail they want in a summary, para. [0086]);
              generating a summary prompt for each chunk in the set of chunks based on (ii) the selected summary style (An example prompt for this type of transcript summarization is provided below, para. [0080]; Only return a one-paragraph summary—do not include any titles or headings, para. [0081]; If the method determines that the combined summary information is useful, the combined summarized transcripts are provided to the LLM to create 1212 a short summary of the combined summarized transcripts …, para. [0084])
              providing the summary prompt for each chunk to a model trained to generate summaries from summary prompts (para. [0072]; An example prompt for this type of transcript summarization is provided below, para. [0080]; Only return a one-paragraph summary—do not include any titles or headings, para. [0081]; If the method determines that the combined summary information is useful, the combined summarized transcripts are provided to the LLM to create 1212 a short summary of the combined summarized transcripts …, para. [0084])
             obtaining a plurality of summaries from the model comprising a separate summary for each summary prompt received in response to providing each summary prompt to the model (para. [0087]); and
            combining the plurality of summaries into a single summary (para. [0087])
           It would have been obvious to one of ordinary skill in the art before the effective filing of the invention to combine the teachings of Mohanty with the method of Schalkwyk in arriving at the missing features of Schalkwyk, as well as to combine the teachings of Liu with the method of Schalkwyk in view of Mohanty in arriving at the missing features of Schalkwyk in view of Mohanty, because such combination would have resulted in allowing for an efficient and comprehensive review of the salient information of the meeting in a concise manner by users (Mohanty, para. [0023]; para. [0041]) and in identifying important information of a meeting according to the amount of detail preferred by a user (Liu, para. [0062]; para. [0086]).
              Per claim 10, Schalkwyk in view of Mohanty and Liu discloses the method of claim 9, 
                  Liu discloses wherein the combining the plurality of summaries into a single summary comprises (i) combining the plurality of summaries into a new summary prompt, (ii) providing the new summary prompt to the model, and (iii) obtaining a new summary comprising the single summary from the model in response to providing the new summary prompt to the model (The method continues by summarizing 1204 each of the meeting transcripts using an LLM.…, para. [0080]; method 1200 may also request the LLM to create 1214 a longer summary of the combined summarized transcripts…., para. [0086]-[0087]).
               Per claim 11, Schalkwyk in view of Mohanty and Liu discloses the method of claim 10, 
                  Liu discloses wherein the model comprises a large language model (LLM) (para. [0084]).
             Per claim 12, Schalkwyk in view of Mohanty and Liu discloses the method of claim 9, 
                  Schalkwyk discloses wherein the method further comprises generating the audio insights by at least performing speech-to-text and diarization processing on the audio content (the ASR module 230 and/or the diarization module 220 (or some other component of the application 150) may index a transcription 310 of the audio data 122 using the time-stamped speaker labels 226 predicted for each segment 222 obtained from the diarization results 224. As shown in FIG. 2, the transcription 310 for the content feed 120 may be indexed by speaker to associate portions of the transcript 202 with the respective speaker …, para. [0039]).
               Per claim 14, Schalkwyk in view of Mohanty and Liu discloses the method of claim 9,
                    Schalkwyk discloses wherein the method further comprises linking two temporally adjacent chunks in the set of chunks with a linking segment from the coherent segments by including the linking segment into both of the two temporally adjacent chunks (the diarization results 224 provide time-stamped speaker labels 226, 226a-n for the received audio data 122 that not only identify who is speaking during a given segment 222, but also identify when speaker changes occur between adjacent segments 222  …, fig. 2, element 204, 224; para. [0038]-[0039], speaker change as included between adjacent segments).

2.         Claims 15-20 are rejected under 35 U.S.C. 103 as being unpatentable over Schalkwyk in view of Liu and Lin US 2017/0357877 A1 (“Lin”)
              Per claim 15, Schalkwyk discloses a method for generating a summary of multimedia content, the method comprising:
                   accessing the multimedia content, the multimedia content comprising audio content and visual content (a system 100 includes a user 2 viewing a content feed 120 played back on a computing/user device 10 through a media player application 150. \… In the example shown, the content feed 120 includes a recorded instructional cooking video played back on the computing device 10 for the user 2 to view and interact with. While examples herein depict the content feed 120 as an audio-visual (AV) feed (e.g., a video) …, para. [0028]);
                 applying one or more audio processing models to the audio content to thereby generate audio insights from the audio content, the audio insights comprising at least one of (i) a coherent transcript that comprises textual representations of spoken utterances contained in the audio content and speaker identifications for the spoken utterances, the coherent transcript being generated using one or more speech-to-text models of the one or more audio processing models or (ii) nonspeech sound labels corresponding to nonspeech sounds, the nonspeech sound labels being generated using one or more nonspeech sound models of the one or more audio processing models  (the ASR module 230 and/or the diarization module 220 (or some other component of the application 150) may index a transcription 310 of the audio data 122 using the time-stamped speaker labels 226 predicted for each segment 222 obtained from the diarization results 224. As shown in FIG. 2, the transcription 310 for the content feed 120 may be indexed by speaker to associate portions of the transcript 202 with the respective speaker …, para. [0039]);
                applying one or more image processing models to the visual content to thereby generate visual insights from the visual content, the visual insights including at least one of (i) text visualized in the visual content, the text being identified by one or more OCR (Optical Character Recognition) models of the one or more image processing models, (ii) object labels for objects visualized in the visual content, the object labels being generated by one or more object identification models of the one or more image processing models, and (iii) identity labels for people represented in the visual content, the people being identified by one or more facial recognition models of the one or more image processing models (para. [0035]; para. [0037]; generate diarization results 224 that include a corresponding speaker label 226 assigned to each segment 222 using a probability model (e.g., a probabilistic generative model) based on the audio data 122 (and optionally the image data 124).… the diarization module 220 may simultaneously execute a face tracking routine to identify which participant is speaking during which segment 222 …, para. [0038]-[0039]);
                generating an aggregated timeline of the audio insights and the visual insights by temporally aligning the audio insights and the visual insights (segment the audio data 122 into a plurality of segments 222, 222a-n (e.g., fixed-length segments or variable-length segments), and generate diarization results 224 that include a corresponding speaker label 226 assigned to each segment 222 using a probability model (e.g., a probabilistic generative model) based on the audio data 122 (and optionally the image data 124)…., para. [0038]-[0039]);
                generating a plurality of extractive summary sentences from the aggregated timeline (fig. 4; para. [0054]);
               generating a summary prompt by combining the plurality of extractive summary sentences into the summary prompt (fig. 2, elements 180, 300; fig. 3; para. [0036]; the large language model 180 is configured to receive the semantically-rich, structured document 300 and the query 112 issued by the user 2 as input …, para. [0047], summary input into large language model 180 as summary prompt)
               Schalkwyk does not explicitly disclose providing the summary prompt to a model trained to generate summaries from summary prompts or obtaining a summary from the model based on the plurality of extractive summary sentences that is received in response to providing the summary prompt to the model
              However, these features are taught by Liu:
               providing the summary prompt to a model trained to generate summaries from summary prompts (para. [0072]; An example prompt for this type of transcript summarization is provided below, para. [0080]; Only return a one-paragraph summary—do not include any titles or headings, para. [0081]; If the method determines that the combined summary information is useful, the combined summarized transcripts are provided to the LLM to create 1212 a short summary of the combined summarized transcripts …, para. [0084]); and
              obtaining a summary from the model based on the plurality of extractive summary sentences that is received in response to providing the summary prompt to the model (para. [0087])
               Schalkwyk in view of Liu does not explicitly disclose generating an aggregated timeline of the audio insights and the visual insights by removing duplicate visual insights when generating the visual insights from the visual content or by removing duplicate audio insights when generating audio insights from the audio content
              However, this feature is taught by Lin (para. [0026]; The “diversity” as used herein pertains to a completeness of the important digital images … The curation application 102 can remove duplicate ones of the representative digital images 112 based on the determined diversity …, para. [0038]; the curation application 102 can implement a face detection algorithm 204 that detects one or more faces in each of the digital images 106 that include at least one face … The importance rating 110 is a rating that indicates the importance of a digital image 106 in the context of the event type, such as a digital image that includes faces of one or more persons who are important to an event (e.g., the bride and groom …, para. [0041]; para. [0077])  
               It would have been obvious to one of ordinary skill in the art before the effective filing of the invention to combine the teachings of Liu with the method of Schalkwyk in arriving at the missing features of Schalkwyk, as well as to combine the teachings of Lin with the method of Schalkwyk in view of Liu in arriving at the missing features of Schalkwyk in view of Liu because such combination would have resulted in identifying important information of a meeting according to the amount of detail preferred by a user (Liu, para. [0062]; para. [0086]), as well as in ensuring a collection of visual insights is not oversized (Lin, para. [0001]; para. [0026]; para. [0038]; para. [0041]).
              Per claim 16, Schalkwyk in view of Liu and Lin discloses the method of claim 15,
                  Liu discloses wherein generating the summary prompt further comprises identifying a selected summary style that is selected from a plurality of different summary styles and including an identification of the selected summary style in the summary prompt (para. [0080]-[0084]; method 1200 may also request the LLM to create 1214 a longer summary of the combined summarized transcripts. The summaries of different lengths give users different options for the amount of detail they want in a summary., para. [0086]).
             Per claim 17, Schalkwyk in view of Liu and Lin discloses the method of claim 15,
                 Liu discloses wherein the model comprises a large language model (LLM) (para. [0084]).
             Per claim 18, Schalkwyk in view of Liu and Lin discloses the method of claim 15, 
                Schalkwyk discloses wherein the method further comprises generating the audio insights by at least performing speech-to-text and diarization processing on the audio content (the ASR module 230 and/or the diarization module 220 (or some other component of the application 150) may index a transcription 310 of the audio data 122 using the time-stamped speaker labels 226 predicted for each segment 222 obtained from the diarization results 224. As shown in FIG. 2, the transcription 310 for the content feed 120 may be indexed by speaker to associate portions of the transcript 202 with the respective speaker …, para. [0039]).
            Per claim 19, Schalkwyk in view of Liu and Lin discloses the method of claim 15, 
                Schalkwyk discloses wherein the method further comprises generating the visual insights by (i) performing facial recognition and object recognition on the visual content (para. [0031]; para. [0037]; generate diarization results 224 that include a corresponding speaker label 226 assigned to each segment 222 using a probability model (e.g., a probabilistic generative model) based on the audio data 122 (and optionally the image data 124).… the diarization module 220 may simultaneously execute a face tracking routine to identify which participant is speaking during which segment 222 …, para. [0038]-[0039]; para. [0042]), and
              Lin discloses (ii) removing duplicate visual insights identified when performing facial recognition and object recognition on the visual content (para. [0026]; The curation application 102 can remove duplicate ones of the representative digital images 112 based on the determined diversity …, para. [0038]; para. [0041])             
             Per claim 20, Schalkwyk in view of Liu and Lin discloses the method of claim 16, 
                 Schalkwyk discloses wherein the multimedia content comprises streaming content and wherein generating the aggregated timeline comprises generating the aggregated timeline for a portion of the multimedia content of a predetermined duration of time (para. [0031]; para. [0038]; para. [0059]).

3.         Claims 7 and 13 are rejected under 35 U.S.C. 103 as being unpatentable over Schalkwyk in view of Mohanty and Liu as applied to claims 1 and 9 above, and further in view of Lin US 2017/0357877 A1 (“Lin”)
           Per claim 7, Schalkwyk in view of Mohanty and Liu discloses the method of claim 1,
               Schalkwyk discloses wherein the method further comprises generating the visual insights by (i) performing facial recognition and object recognition on the visual content (para. [0031]; para. [0037]; generate diarization results 224 that include a corresponding speaker label 226 assigned to each segment 222 using a probability model (e.g., a probabilistic generative model) based on the audio data 122 (and optionally the image data 124).… the diarization module 220 may simultaneously execute a face tracking routine to identify which participant is speaking during which segment 222 …, para. [0038]-[0039]; para. [0042])
              Schalkwyk in view of Liu does not explicitly disclose removing duplicate visual insights identified when performing facial recognition and object recognition on the visual content
               However, this feature is taught by Lin (para. [0026]; The curation application 102 can remove duplicate ones of the representative digital images 112 based on the determined diversity …, para. [0038]; para. [0041])
               It would have been obvious to one of ordinary skill in the art before the effective filing of the invention to combine the teachings of Lin with the method of Schalkwyk in view of Mohanty and Liu in arriving at the missing features of Schalkwyk in view of Mohanty and Liu, because such combination would have resulted in ensuring a collection of visual insights is not oversized (Lin, para. [0001]; para. [0026]; para. [0038]; para. [0041]).     
              Per claim 13, Schalkwyk in view of Mohanty and Liu discloses the method of claim 9,
                 Schalkwyk discloses wherein the method further comprises generating the visual insights by (i) performing facial recognition and object recognition on the visual content (para. [0031]; para. [0037]; generate diarization results 224 that include a corresponding speaker label 226 assigned to each segment 222 using a probability model (e.g., a probabilistic generative model) based on the audio data 122 (and optionally the image data 124).… the diarization module 220 may simultaneously execute a face tracking routine to identify which participant is speaking during which segment 222 …, para. [0038]-[0039]; para. [0042])
               Schalkwyk in view of Mohanty and Liu does not explicitly disclose (ii) removing duplicate visual insights identified when performing facial recognition and object recognition on the visual content
               However, this feature is taught by Lin (para. [0026]; The curation application 102 can remove duplicate ones of the representative digital images 112 based on the determined diversity …, para. [0038]; para. [0041])
               It would have been obvious to one of ordinary skill in the art before the effective filing of the invention to combine the teachings of Lin with the method of Schalkwyk in view of Mohanty and Liu in arriving at the missing features of Schalkwyk in view of Mohanty and Liu, because such combination would have resulted in ensuring a collection of visual insights is not oversized (Lin, para. [0001]; para. [0026]; para. [0038]; para. [0041]).        

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. See PTO 892 form.
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to OLUJIMI A ADESANYA whose telephone number is (571)270-3307. The examiner can normally be reached Monday-Friday 8:30-5:00pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Richemond Dorvil can be reached at 571-272-7602. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/OLUJIMI A ADESANYA/Primary Examiner, Art Unit 2658
Read full office action
Prosecution Timeline

Jun 09, 2023
Application Filed
Aug 02, 2025
Non-Final Rejection — §103
Nov 12, 2025
Examiner Interview Summary
Nov 12, 2025
Applicant Interview (Telephonic)
Dec 01, 2025
Response Filed
Feb 18, 2026
Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/598,633
Patent 12591739
METHOD AND SYSTEM FOR DIACRITIZING ARABIC TEXT
2y 5m to grant Granted Mar 31, 2026
18/457,107
Patent 12585686
EVENT DETECTION AND CLASSIFICATION METHOD, APPARATUS, AND DEVICE
2y 5m to grant Granted Mar 24, 2026
18/482,440
Patent 12585481
METHOD AND ELECTRONIC DEVICE FOR PERFORMING TRANSLATION
2y 5m to grant Granted Mar 24, 2026
18/395,386
Patent 12578779
Multiple Stage Network Microphone Device with Reduced Power Consumption and Processing Load
2y 5m to grant Granted Mar 17, 2026
18/514,240
Patent 12579181
Synchronization of Sensor Network with Organization Ontology Hierarchy
2y 5m to grant Granted Mar 17, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
66%
Grant Probability
91%
With Interview (+25.5%)
3y 6m
Median Time to Grant
Moderate
PTA Risk
Based on 655 resolved cases by this examiner. Grant probability derived from career allow rate.