Last updated: May 29, 2026
Application No. 18/359,158
VIDEO SUMMARY GENERATION FOR VIRTUAL CONFERENCES

Non-Final OA §103
Filed
Jul 26, 2023
Examiner
PATEL, HEMANT SHANTILAL
Art Unit
2694
Tech Center
2600 — Communications
Assignee
Zoom Video Communications, Inc.
OA Round
3 (Non-Final)
Interview Optional

— +13.5% interview lift. Interview lift (+13.5%) is below the 15.0% threshold. A written response is recommended.
Based on 946 resolved cases, 2023–2026
Examiner Intelligence

PATEL, HEMANT SHANTILAL View full profile →
Grants 81% — above average
Career Allowance Rate
768 granted / 946 resolved
+19.2% vs TC avg
Moderate +14% lift
Without
With
+13.5%
Interview Lift
resolved cases with interview
Typical timeline
2y 9m
Avg Prosecution
20 currently pending
Career history
964
Total Applications
across all art units
Statute-Specific Performance

§101
0.9%
-39.1% vs TC avg
§103
78.9%
+38.9% vs TC avg
§102
4.9%
-35.1% vs TC avg
§112
9.3%
-30.7% vs TC avg
Black line = Tech Center average estimate • Based on career data from 946 resolved cases
Office Action

§103
DETAILED ACTION
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on December 23, 2025 has been entered.

Response to Amendment
Applicant’s arguments with respect to claims 1-20 have been considered but are moot in view of new ground of rejection necessitated due to claim amendments.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 1-2, 4-5, 7-8, 10-12, 14-15, 17-20 are rejected under 35 U.S.C. 103 as being unpatentable over Huang (US Patent Application Publication No. 2022/0353469), and further in view of Chalana (US Patent Application Publication No. 2022/0215052).
Regarding claim 1, Huang teaches a method comprising:
generating a text summary for a meeting video (Paragraphs 0092-0095);
generating an audio summary based on the text summary (Paragraphs 0108, 0110, 0114);
determining a first set of correspondences between a first set of video portions from the meeting video and portions of the text summary based on a transcript of the meeting video (Paragraphs 0102, 0127, 0140);
identifying a plurality of key moments at least from the first set of video portions (Paragraphs 0094, 0100 identifying and scoring relevant key moments of video portions corresponding to text transcript portions using machine learning model);
selecting a plurality of video frames from the first set of correspondences based on the plurality of key moments (Paragraphs 0102-0103, 0127-0128, 0140-0141 selecting relevant video clips, Note: video as sequence of video frame was well known in the art.); and
generating a video summary of the meeting video based on the audio summary and the plurality of video frames (Paragraphs 0104, 0129, 0142 generating video summary, Paragraphs 0108, 0110, 0114 audio summary is also based on the same timestamps that are used for video summary. It would have been obvious the video summary is based on text, audio and video portions selected based on the same timestamp.). 
Huang does not teach determining a second set of correspondences between a second set of video portions from the meeting video and the portions of the text summary based on image data of the meeting video; identifying a plurality of key moments at least from the second set of video portions; and selecting a plurality of video frames from the second set of correspondences based on the plurality of key moments.
However, in the similar field of communication, Chalana teaches determining a second set of correspondences between a second set of video portions from the meeting video and the portions of the text summary based on image data of the meeting video (based on visual cues, gestures, graphical components); identifying a plurality of key moments at least from the second set of video portions (Paragraphs 0081, 0092 identifying key moments of upvote, downvote, clap etc. in video media using neural network, 0083 visual cues in video); and selecting a plurality of video frames from the second set of correspondences based on the plurality of key moments (video portions linked to sentences in summary) (Paragraphs 0078-0088).
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the present invention to modify Huang to include determining a second set of correspondences between a second set of video portions from the meeting video and the portions of the text summary based on image data of the meeting video; identifying a plurality of key moments at least from the second set of video portions; and selecting a plurality of video frames from the second set of correspondences based on the plurality of key moments as taught by Chalana in order to enable video summary “influenced by audio and visual cues, such as louder voices, gestures, changes in a slide deck, repetition of or return to graphical components of a slide deck or presentation, or the like” (emphasis added) (Chalana, Paragraph 0083).
Regarding claim 2, Huang teaches the text summary is generated using an artificial intelligence (AI)-based summarization model (Paragraphs 0094, 0100). Chalana teaches the text summary is generated using an artificial intelligence (AI)-based summarization model (Paragraphs 0081-0083).
Regarding claim 4, Huang teaches determining portions in the transcript corresponding to the portions of the text summary; identifying time ranges for the portions in the transcript; and selecting the first set of video portions of the meeting video based on the time ranges (Paragraphs 0102-0104, 0127-0129, 0140-0142, 0108, 0114 based on star time and end time).
Regarding claim 5, Huang teaches determining portions in the transcript corresponding to the portions of the text summary comprises: identifying candidate portions in the transcript corresponding to the portions of the text summary using a similarity model; ranking the candidate portions to generate a ranking list of candidate portions based on corresponding similarity scores for the candidate portions; and selecting the portions in the transcript with similarity scores greater than a threshold value from the ranking list of candidate portions (Paragraphs 0146, 0150-0154). Chalana teaches determining portions in the transcript corresponding to the portions of the text summary comprises: identifying candidate portions in the transcript corresponding to the portions of the text summary using a similarity model; ranking the candidate portions to generates a ranking list of candidate portions based on corresponding similarity scores for the candidate portions; and selecting the portions in the transcript with similarity scores greater than a threshold value from the ranking list of candidate portions (Paragraphs 0063, 0086, 142, 0165, 0188, 0211).
Regarding claim 7, Huang teaches classifying meeting audio (text) using classification model, but Huang does not teach prior to selecting the plurality of video frames, classifying the meeting video using a classification model to generate a meeting classifier; and identifying the plurality of key moments at least from the first set of video portions and the second set of video portions based on the meeting classifier.
However, in the similar field of communication, Chalana teaches prior to selecting the plurality of video frames, classifying the meeting video using a classification model to generate a meeting classifier; and identifying the plurality of key moments at least from the first set of video portions and the second set of video portions based on the meeting classifier (Paragraphs 0081-0083, 0092 classifying as activity, category of organization, industry, and using specific vocabulary and syntax model to identify visual cues using neural network)
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the present invention to modify Huang to include prior to selecting the plurality of video frames, classifying the meeting video using a classification model to generate a meeting classifier; and identifying the plurality of key moments at least from the first set of video portions and the second set of video portions based on the meeting classifier as taught by Chalana in order to enable “vocabulary specific to an activity, category of organization, industry, or the like” and “vocabulary and syntax used by different communities” (Chalana, Paragraph 0092).
Regarding claim 8, Chalana teaches wherein the plurality of key moments is related to presentation graphs, emojis, sentiments, engagements in the meeting video (Paragraphs 0083 slide deck, graphical components, gestures etc., 0092 emojis, various sentiments and engagements with responses). (Note: Huang and Chalana do not specifically teach group photos, however it would have been obvious to a person of ordinary skill the art before the effective filing date of the present invention to modify Huang and Chalana with additional element like individual photo or group photos as an implementation choice.).
Regarding claim 10, Huang teaches ranking the plurality of key moments based on user preferences and relationship to the audio summary to create a ranking list of key moments; and selecting the plurality of video frames based on the ranking list of key moments (Paragraphs 0091-0142 plurality of embodiments of ranking key moments highlighted as user preferences using machine learning and corresponding video clips selected for video summary). Chalana teaches ranking the plurality of key moments based on user preferences and relationship to the audio summary to create a ranking list of key moments; and selecting the plurality of video frames based on the ranking list of key moments (Paragraphs 0059-0088, 0142, 0165, 0188, 0211).
Regarding claim 11, Huang teaches generating a video summary of the meeting video based the audio summary and the plurality of video frames comprises aligning the plurality of video frames to the portions of the audio summary (Paragraphs 0102-0104, 0127-0129, 0140-0142, 0108, 0114 text summary, corresponding audio and video excerpts based on same timestamps).
Regarding claim 12, Huang teaches a system (Figs. 1-2) comprising:
a communications interface (Fig. 1 item 114, Fig. 2 item 214); a non-transitory computer-readable medium (Fig. 2 item 204); and one or more processors (Fig. 2 item 202) communicatively coupled to the communications interface and the non-transitory computer-readable medium, the one or more processors configured to execute processor-executable instructions (Fig. 2 items 216, 218, 220) stored in the non-transitory computer-readable medium (Paragraphs 0046-0074) to:
generate a text summary for a meeting video (Paragraphs 0092-0095);
generate an audio summary based on the text summary (Paragraphs 0108, 0110, 0114);
determine a first set of correspondences between a first set of video portions from the meeting video and portions of the text summary based on a transcript of the meeting video (Paragraphs 0102, 0127, 0140);
identify a plurality of key moments at least from the first set of video portions (Paragraphs 0094, 0100 identifying and scoring relevant key moments of video portions corresponding to text transcript portions using machine learning model);
select a plurality of video frames from the first set of correspondences based on the plurality of key moments (Paragraphs 0102-0103, 0127-0128, 0140-0141 selecting relevant video clips, Note: video as sequence of video frame was well known in the art.); and
generate a video summary of the meeting video based on the audio summary and the plurality of video frames (Paragraphs 0104, 0129, 0142 generating video summary, Paragraphs 0108, 0110, 0114 audio summary is also based on the same timestamps that are used for video summary. It would have been obvious the video summary is based on text, audio and video portions selected based on the same timestamp.). 
Huang does not teach to determine a second set of correspondences between a second set of video portions from the meeting video and the portions of the text summary based on image data of the meeting video; identify a plurality of key moments at least from the second set of video portions; and select a plurality of video frames from the second set of correspondences based on the plurality of key moments.
However, in the similar field of communication, Chalana teaches to determine a second set of correspondences between a second set of video portions from the meeting video and the portions of the text summary based on image data of the meeting video (based on visual cues, gestures, graphical components); identify a plurality of key moments at least from the second set of video portions (Paragraphs 0081, 0092 identifying key moments of upvote, downvote, clap etc. in video media using neural network, 0083 visual cues in video); and select a plurality of video frames from the second set of correspondences based on the plurality of key moments (video portions linked to sentences in summary) (Paragraphs 0078-0088).
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the present invention to modify Huang to determine a second set of correspondences between a second set of video portions from the meeting video and the portions of the text summary based on image data of the meeting video; identify a plurality of key moments at least from the second set of video portions; and select a plurality of video frames from the second set of correspondences based on the plurality of key moments as taught by Chalana in order to enable video summary “influenced by audio and visual cues, such as louder voices, gestures, changes in a slide deck, repetition of or return to graphical components of a slide deck or presentation, or the like” (emphasis added) (Chalana, Paragraph 0083).
Regarding claim 14, Huang teaches to determine portions in the transcript corresponding to the portions of the text summary; identify time ranges for the portions in the transcript; and select the first set of video portions of the meeting video based on the time ranges (Paragraphs 0102-0104, 0127-0129, 0140-0142, 0108, 0114 based on star time and end time).
Regarding claim 15, Huang teaches determining portions in the transcript corresponding to the portions of the text summary comprises: identifying candidate portions in the transcript corresponding to the portions of the text summary using a similarity model; ranking the candidate portions in the transcript to generates a ranking list of candidate portions based on corresponding similarity scores for the candidate portions in the transcript; and selecting the portions in the transcript with similarity scores greater than a threshold value from the ranking list of candidate portions (Paragraphs 0146, 0150-0154).
Regarding claim 17, Huang teaches to classify the meeting audio (text) using classification model, but Huang does not teach to classify the meeting video using a classification model to generate a meeting classifier; identify the plurality of key moments at least from the first set of video portions and the second set of video portions based on the meeting classifier, wherein the plurality of key moments is related to presentation graphs, group photos, emojis, sentiments, engagements in the meeting video.
However, in the similar field of communication, Chalana teaches to classify the meeting video using a classification model to generate a meeting classifier; and identifying the plurality of key moments at least from the first set of video portions and the second set of video portions based on the meeting classifier (Paragraphs 0081-0083, 0092 classifying as activity, category of organization, industry, and using specific vocabulary and syntax model to identify visual cues using neural network), wherein the plurality of key moments is related to presentation graphs, emojis, sentiments, engagements in the meeting video (Paragraphs 0083 slide deck, graphical components, gestures etc., 0092 emojis, various sentiments and engagements with responses).
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the present invention to modify Huang to classify the meeting video using a classification model to generate a meeting classifier; and identifying the plurality of key moments at least from the first set of video portions and the second set of video portions based on the meeting classifier, wherein the plurality of key moments is related to presentation graphs, emojis, sentiments, engagements in the meeting video as taught by Chalana in order to enable “vocabulary specific to an activity, category of organization, industry, or the like” and “vocabulary and syntax used by different communities” (Chalana, Paragraph 0092). (Note: Huang and Chalana do not specifically teach group photos, however it would have been obvious to a person of ordinary skill the art before the effective filing date of the present invention to modify Huang and Chalana with additional element like individual photo or group photos as an implementation choice.).
Regarding claim 18, Huang teaches a non-transitory computer-readable medium (Fig. 2 item 204) comprising processor-executable instructions (Fig. 2 items 216, 218, 220) configured to cause one or more processors (Fig. 2 item 202) (Paragraphs 0066-0074) to:
generate a text summary for a meeting video (Paragraphs 0092-0095);
generate an audio summary based on the text summary (Paragraphs 0108, 0110, 0114);
determine a first set of correspondences between a first set of video portions from the meeting video and portions of the text summary based on a transcript of the meeting video (Paragraphs 0102, 0127, 0140);
identify a plurality of key moments at least from the first set of video portions (Paragraphs 0094, 0100 identifying and scoring relevant key moments of video portions corresponding to text transcript portions using machine learning model);
select a plurality of video frames from the first set of correspondences based on the plurality of key moments (Paragraphs 0102-0103, 0127-0128, 0140-0141 selecting relevant video clips, Note: video as sequence of video frame was well known in the art.); and
generate a video summary of the meeting video based on the audio summary and the plurality of video frames (Paragraphs 0104, 0129, 0142 generating video summary, Paragraphs 0108, 0110, 0114 audio summary is also based on the same timestamps that are used for video summary. It would have been obvious the video summary is based on text, audio and video portions selected based on the same timestamp.). 
Huang does not teach to determine a second set of correspondences between a second set of video portions from the meeting video and the portions of the text summary based on image data of the meeting video; identify a plurality of key moments at least from the second set of video portions; and select a plurality of video frames from the second set of correspondences based on the plurality of key moments.
However, in the similar field of communication, Chalana teaches to determine a second set of correspondences between a second set of video portions from the meeting video and the portions of the text summary based on image data of the meeting video (based on visual cues, gestures, graphical components); identify a plurality of key moments at least from the second set of video portions (Paragraphs 0081, 0092 identifying key moments of upvote, downvote, clap etc. in video media using neural network, 0083 visual cues in video); and select a plurality of video frames from the second set of correspondences based on the plurality of key moments (video portions linked to sentences in summary) (Paragraphs 0078-0088).
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the present invention to modify Huang to determine a second set of correspondences between a second set of video portions from the meeting video and the portions of the text summary based on image data of the meeting video; identify a plurality of key moments at least from the second set of video portions; and select a plurality of video frames from the second set of correspondences based on the plurality of key moments as taught by Chalana in order to enable video summary “influenced by audio and visual cues, such as louder voices, gestures, changes in a slide deck, repetition of or return to graphical components of a slide deck or presentation, or the like” (emphasis added) (Chalana, Paragraph 0083).
Regarding claim 19, Huang teaches classifying meeting audio (text) using classification model; rank the plurality of key moments based on user preferences and relationship to the audio summary to create a ranking list of key moments; and select the plurality of video frames based on the ranking list of key moments (Paragraphs 0091-0142 variety of embodiments of ranking key moments highlighted as user preferences using machine learning and corresponding video clips selected for video summary), but Huang does not teach to classify the meeting video using a classification model to generate a meeting classifier; and identifying the plurality of key moments at least from the first set of video portions and the second set of video portions based on the meeting classifier, wherein the plurality of key moments is related to presentation graphs, group photos, emojis, sentiments, engagements in the meeting video.
However, in the similar field of communication, Chalana teaches to classify the meeting video using a classification model to generate a meeting classifier; and identify the plurality of key moments at least from the first set of video portions and the second set of video portions based on the meeting classifier (Paragraphs 0081-0083, 0092 classifying as activity, category of organization, industry, and using specific vocabulary model to identify visual cues), wherein the plurality of key moments is related to presentation graphs, emojis, sentiments, engagements in the meeting video (Paragraphs 0083 slide deck, graphical components, gestures etc., 0092 emojis, various sentiments and engagements with responses); and rank the plurality of key moments based on user preferences and relationship to the audio summary to create a ranking list of key moments; and select the plurality of video frames based on the ranking list of key moments (Paragraphs 0059-0088, 0142, 0165, 0188, 0211).
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the present invention to modify Huang to classify the meeting video using a classification model to generate a meeting classifier; and identifying the plurality of key moments at least from the first set of video portions and the second set of video portions based on the meeting classifier, wherein the plurality of key moments is related to presentation graphs, emojis, sentiments, engagements in the meeting video as taught by Chalana in order to enable “vocabulary specific to an activity, category of organization, industry, or the like” and “vocabulary and syntax used by different communities” (Chalana, Paragraph 0092). (Note: Huang and Chalana do not specifically teach group photos, however it would have been obvious to a person of ordinary skill the art before the effective filing date of the present invention to modify Huang and Chalana with additional element like individual photo or group photos as an implementation choice.).
Regarding claim 20, Huang teaches generating a video summary of the meeting video based the audio summary and the plurality of video frames comprises aligning the plurality of video frames to the portions of the audio summary (Paragraphs 0102-0104, 0127-0129, 0140-0142, 0108, 0114 text summary, corresponding audio and video excerpts based on same timestamps).
Claims 3, 13 are rejected under 35 U.S.C. 103 as being unpatentable over Huang and Chalana as applied to claims 1, 12 above, and further in view of Barbieri (US Patent Application Publication No. 2007/0171303).
Regarding claim 3, Huang and Chalana do not teach the text summary is converted to the audio summary using a text-to-speech (TTS) model.
However, in the similar field of communication, Barbieri teaches the text summary is converted to the audio summary using a text-to-speech (TTS) model (Paragraphs 0006-0007, 0027-0029).
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the present invention to modify Huang and Chalana to include the text summary converted to the audio summary using a text-to-speech (TTS) model as taught by Barbieri in order to enable “synthesizing the text summary into speech, generating a video summary of the audio-visual program content, and mixing the synthesized speech with the video summary “ (Barbieri, Paragraph 0007).
Regarding claim 13, Huang teaches the text summary is generated using an artificial intelligence (AI)-based summarization model (Paragraphs 0094, 0100), but Huang and Chalana do not teach the text summary is converted to the audio summary using a text-to-speech (TTS) model.
However, in the similar field of communication, Barbieri teaches the text summary is converted to the audio summary using a text-to-speech (TTS) model (Paragraphs 0006-0007, 0027-0029).
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the present invention to modify Huang and Chalana to include the text summary converted to the audio summary using a text-to-speech (TTS) model as taught by Barbieri in order to enable “synthesizing the text summary into speech, generating a video summary of the audio-visual program content, and mixing the synthesized speech with the video summary “ (Barbieri, Paragraph 0007).

Claims 6, 16 are rejected under 35 U.S.C. 103 as being unpatentable over Huang and Chalana as applied to claims 1, 12 above, and further in view of Dubey (US Patent No. 12,231,745).
Regarding claim 6, Huang and Chalana do not teach specifically teach identifying candidate video portions corresponding to the portions of the text summary by comparing the image data of the meeting video with the text summary using a similarity model; ranking the candidate video portions to generate a ranking list of candidate video portions based on corresponding similarity scores for the candidate video portions; and selecting the second set of video portions with similarity scores greater than a threshold value from the ranking list of candidate video portions.
However, in the similar field of communication, Dubey teaches identifying candidate video portions corresponding to the portions of the text summary by comparing the image data of the meeting video with the text summary using a similarity model; ranking the candidate video portions to generate a ranking list of candidate video portions based on corresponding similarity scores for the candidate video portions; and selecting the second set of video portions with similarity scores greater than a threshold value from the ranking list of candidate video portions (col. 4 ll. 41-59, col. 12 ll. 11-col. 15 ll.61, col. 19 ll.47-55 grabbing video frames corresponding to text, ranking elastic search with different levels of matches, selecting based on threshold of popularity from top-ranked i.e. list by ranking).
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the present invention to modify Huang and Chalana to include identifying candidate video portions corresponding to the portions of the text summary by comparing the image data of the meeting video with the text summary using a similarity model; ranking the candidate video portions to generate a ranking list of candidate video portions based on corresponding similarity scores for the candidate video portions; and selecting the second set of video portions with similarity scores greater than a threshold value from the ranking list of candidate video portions as taught by Dubey in order to “determine an ordered arrangement of the number of highest ranked segments” (Dubey, col. 13 ll. 4-5) and include “video content data 660 associated with candidate video content that may include certain quotes, popular quote data” (Dubey, col. 14 ll. 53-55).
Regarding claim 16, Huang and Chalana do not teach specifically teach identifying candidate video portions corresponding to the portions of the text summary by comparing the image data of the meeting video with the text summary using a similarity model; ranking the candidate video portions to generate a ranking list of candidate video portions based on corresponding similarity scores for the candidate video portions; and selecting the second set of video portions with similarity scores greater than a threshold value from the ranking list of candidate video portions.
However, in the similar field of communication, Dubey teaches identifying candidate video portions corresponding to the portions of the text summary by comparing the image data of the meeting video with the text summary using a similarity model; ranking the candidate video portions to generate a ranking list of candidate video portions based on corresponding similarity scores for the candidate video portions; and selecting the second set of video portions with similarity scores greater than a threshold value from the ranking list of candidate video portions (col. 4 ll. 41-59, col. 12 ll. 11-col. 15 ll.61, col. 19 ll.47-55 grabbing video frames corresponding to text, ranking elastic search with different levels of matches, selecting based on threshold of popularity from top-ranked i.e. list by ranking).
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the present invention to modify Huang and Chalana to include identifying candidate video portions corresponding to the portions of the text summary by comparing the image data of the meeting video with the text summary using a similarity model; ranking the candidate video portions to generate a ranking list of candidate video portions based on corresponding similarity scores for the candidate video portions; and selecting the second set of video portions with similarity scores greater than a threshold value from the ranking list of candidate video portions as taught by Dubey in order to “determine an ordered arrangement of the number of highest ranked segments” (Dubey, col. 13 ll. 4-5) and include “video content data 660 associated with candidate video content that may include certain quotes, popular quote data” (Dubey, col. 14 ll. 53-55).

Claim 9 is rejected under 35 U.S.C. 103 as being unpatentable over Huang and Chalana as applied to claim 1 above, and further in view of Hu (US Patent Application Publication No. 2017/0169853).
Regarding claim 9, Huang and Chalana do not teach identifying the plurality of key moments further comprising comprises extracting terminologies and named entities from the transcript.
However, in the similar field of communication, Hu teaches identifying the plurality of key moments further comprising comprises extracting terminologies and named entities from the transcript (Paragraphs 0031, 0074-0081).
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the present invention to modify Huang and Chalana to include identifying the plurality of key moments further comprising comprises extracting terminologies and named entities from the transcript as taught by Hu in order to “generate the set of text data based on the classified elements” by applying “any suitable named-entity recognition process or system to identify one or more named entities associated with the media program” (Hu, Paragraph 0031).

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to HEMANT PATEL whose telephone number is (571)272-8620. The examiner can normally be reached M-F 8:00 AM - 4:30 PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Fan Tsang can be reached at 571-272-7547. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

HEMANT PATEL
Primary Examiner
Art Unit 2694



/HEMANT S PATEL/           Primary Examiner, Art Unit 2694
Read full office action
Prosecution Timeline

Jul 26, 2023
Application Filed
May 06, 2025
Non-Final Rejection mailed — §103
Jul 30, 2025
Response Filed
Sep 15, 2025
Final Rejection mailed — §103
Dec 16, 2025
Response after Non-Final Action
Dec 23, 2025
Request for Continued Examination
Jan 25, 2026
Response after Non-Final Action
Feb 13, 2026
Non-Final Rejection mailed — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/020,427
Patent 12611995
Holder for a Mobile Device
2y 8m to grant Granted Apr 28, 2026
18/264,633
Patent 12615349
MEETING VIDEO SUBSTITUTES
2y 8m to grant Granted Apr 28, 2026
18/423,990
Patent 12609126
Multi-Channel Signal Encoding and Decoding Method and Apparatus
2y 2m to grant Granted Apr 21, 2026
18/545,203
Patent 12610010
Method and a system for silent and non-exposing alerting of previously abandoned calls
2y 4m to grant Granted Apr 21, 2026
18/129,192
Patent 12598254
SYSTEMS AND METHODS RELATING TO GENERATING SIMULATED INTERACTIONS FOR TRAINING CONTACT CENTER AGENTS
3y 0m to grant Granted Apr 07, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

3-4
Expected OA Rounds
81%
Grant Probability
95%
With Interview (+13.5%)
2y 9m (~0m remaining)
Median Time to Grant
High
PTA Risk
Based on 946 resolved cases by this examiner. Grant probability derived from career allowance rate.