Last updated: May 29, 2026
Application No. 18/129,977
AUTOMATED SEGMENTATION OF DIGITAL PRESENTATION DATA

Non-Final OA §103
Filed
Apr 03, 2023
Priority
Feb 18, 2020 — provisional 62/978,127 +1 more
Examiner
TRAN, TUYETLIEN T
Art Unit
2179
Tech Center
2100 — Computer Architecture & Software
Assignee
Micah Development LLC
OA Round
4 (Non-Final)
Interview Optional

— +33.5% interview lift. Examiner has a relatively high allowance rate (68%); +33.5% interview lift. A written response may suffice.
Based on 642 resolved cases, 2023–2026
Examiner Intelligence

TRAN, TUYETLIEN T View full profile →
Grants 68% — above average
Career Allowance Rate
434 granted / 642 resolved
+12.6% vs TC avg
Strong +34% interview lift
Without
With
+33.5%
Interview Lift
resolved cases with interview
Typical timeline
3y 10m
Avg Prosecution
15 currently pending
Career history
664
Total Applications
across all art units
Statute-Specific Performance

§101
1.7%
-38.3% vs TC avg
§103
89.8%
+49.8% vs TC avg
§102
5.3%
-34.7% vs TC avg
§112
1.1%
-38.9% vs TC avg
Black line = Tech Center average estimate • Based on career data from 642 resolved cases
Office Action

§103
DETAILED ACTION
This Office Action is in response to the Amendment filed on 09/24/2025.
Claims 1-20 are pending claims; Claims 1, 19 and 20 are independent claims. This action is made final. 

Notice of Pre-AIA  or AIA  Status

The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.

Claim Rejections - 35 USC § 103 

The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

Claims 1-15, 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over Yun et al. (US 2015/0082330 A1; hereinafter Yun) in view of Kim et al. (US 2019/0250934 A1; hereinafter Kim).

As to claims 1, 19, and 20, Yun teaches:

(Claim 1) A method (see ¶ 0006) comprising:
(Claim 19) A computing apparatus comprising: at least one processor; and at least one memory storing instructions that, when executed by the at least one processor (see Figs. 17-18 and ¶ 0109-0113), configure the apparatus to: 
(Claim 20) A non-transitory computer-readable storage medium, the computer-readable storage medium including instructions that when executed by at least one computer (see ¶ 0122), cause the at least one computer to:
extracting speaker audio data from audio data of presentation digital data (see Fig. 10 and ¶ 0037, 0070-0072; voice analysis, speech analysis, audio fingerprint analysis. ¶ 0071; the voice analysis unit 1010 is configured to receive and analyze audio contents of the channel programs); 
analyzing the speaker audio data to identify an audio characteristic of the speaker audio data, the audio characteristic comprising at least one of tone, frequency, cadence, or volume of the speaker audio data (see ¶ 0059, 0070-0072; The voice analysis unit 1010 is configured to receive and analyze audio contents of the channel programs and identify a speaker such as an actor, an actress, a singer, etc. based on voice characteristics of the speaker. voice processing techniques such as Gaussian mixture models, hidden Markov models, decision trees, neural networks, frequency estimation, etc. ¶¶ 0087; intensity of the noise); 
analyzing video data of the presentation digital data to identify a video characteristic of the video data, the video characteristic comprising a non-verbal cue (see ¶ 0037, 0049, 0051, 0070; analyzing video contents using any suitable video analysis methods such as video fingerprint analysis, a scene analysis, a face recognition analysis, and the like.  ¶ 0063, 0080; a face recognition may be performed on the video contents of the channel programs to identify Actor A and Actor B for Channel 1. The face recognition extracts facial features of Actors A and B in the video contents and compares the extracted facial features and reference facial features of faces stored in the reference content database 214);
identifying portions of the presentation digital data based on changes in the audio characteristic of the speaker audio data and the video characteristic of the video data (see ¶ 0051; a scene analysis may be performed on video and audio contents of a plurality of channel programs for a plurality of channels. The channel analysis unit 212 performs such an analysis on the video and audio contents of the channel programs to generate video and audio content tags of the channel programs. In one embodiment, the channel analysis unit 212 may determine that a soccer game has ended in a draw and is about to have a penalty shootout on Channel 60. For example, the channel analysis unit 212 may perform a scene analysis on the video contents of Channel 60 to recognize the soccer game by analyzing faces of players, a game field configuration, a goal post, a game ball, a game score, etc. Further, the audio contents of Channel 60 may be analyzed to recognize speech of commentators, background sounds such as crowd noise, music, etc. Based on the scene analysis, the channel analysis unit 212 may generate a content tag for Channel 60 including the recognized context and/or information on the contents of the channel program such as names of the players, a penalty shootout context, teams, scores, etc); 
automatically generating tags for the identified portions of the presentation digital data based on the audio characteristic of the speaker audio data and the video characteristic of the video data (see ¶ 0051, 0090; Based on the scene analysis, the channel analysis unit 212 may generate a content tag for Channel 60 including the recognized context and/or information on the contents of the channel program such as names of the players, a penalty shootout context, teams, scores, etc); and 
storing the identified portions of the presentation digital data and associated tags for retrieval (See ¶ 0090; the generated content tags 1110 and 1120 for the channels are then transmitted to the channel program recommendation unit 220 for generating a channel program recommendation).
While Yun discloses employing any voice processing techniques to analyze the audio content (see ¶ 0037, 0049, 0051, 0070; The voice analysis unit 1010 is configured to receive and analyze audio contents of the channel programs and identify a speaker such as an actor, an actress, a singer, etc. based on voice characteristics of the speaker. voice processing techniques such as Gaussian mixture models, hidden Markov models, decision trees, neural networks, frequency estimation, etc), Yun does not expressly teach the analyzing of the speaker audio data comprising using signal processing to identify changes in at least one of the tone, the frequency or the volume of the speaker audio data. 
However, Kim is relied upon for teaching the limitations. Specifically, Kim discloses a method, an apparatus, and a medium for analyzing of speaker audio data comprising using signal processing to identify changes in at least one of the tone, the frequency or the volume of the speaker audio data (see Fig. 2 and ¶ 0040; speech analysis module 215 receives as an input audio data from microphone 210, and detects when a user is speaking speech detection module 215 outputs audio data, which has been processed to exclude background noise and quiet patches, and has been time-stamped so that a user's words received from microphone 210 at a specific point in time can be associated with, and processed in conjunction with, image data received from camera 205 at the same point in time.  ¶ 0041; speech analysis module 220 further processes the audio data to identify cues in the audio data associated with an emotion or context of the audio data. For example, speech analysis module 220 may identify changes in pitch or volume in the audio data as cues associated with the user's emotion or a context of a user's statement. In certain embodiments, speech analysis module 220 may also output data as to a determined probability of a trait of the speaker. For example, the output of speech analysis module 220 may indicate that there is a 75% probability that the recorded speaker is male) and automatically generating tags for the identified portions of the presentation digital data based on the audio characteristic of the speaker audio data (¶ 0043; speech recognition module 230 may also tag words in the candidate word stream comprising cues which can be applied to a categorization of the speaker’s emotional state or context of the speaker’s words).
Both teachings each discloses a system for audio data analysis and generate tags/metadata associated with the identified portions of the media content.  Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention to have modified the teaching of Yun to include the features of analyzing the speaker audio data to identify a characteristic of the speaker audio data as disclosed by Kim so that portions of video/media content can be identified and tagged for later retrieval as claimed.  One would be motivated to make such a combination because of the overlapping subject matter and the advantages provided by Kim that providing interfaces by which processor-powered electronic devices which can accurately and responsively recognize and convert multi-modal user inputs into events and to improve the functionality of computers and other processor-powered apparatus (Kim: see ¶ 0004).

As to claim 2, the rejection of claim 1 is incorporated.  Yun and Kim further teach wherein the speaker audio data is extracted using an audio analyzer (Yun: see ¶ 0037, 00580070-0072; The channel analysis unit 212 is configured to generate the video and audio content tags by analyzing the video and audio contents using any suitable video and/or audio analysis methods such as a voice analysis, a speech analysis, an audio fingerprint analysis, a video fingerprint analysis, a scene analysis, a face recognition analysis, and the like. The reference content database 214 may include a video fingerprint database, an audio fingerprint database, a facial feature database, a voice model database, an object database, a background sound database, an acoustic model database, a facial model database, an object model database, and the like).  

As to claim 3, the rejection of claim 1 is incorporated.  Yun and Kim further teach 
wherein the speaker audio data is analyzed using a machine-learning engine (Yun: see ¶ 0071-0072; The voice analysis unit 1010 also accesses the voice model database in the reference content database 214 and identifies one or more speakers based on the extracted voice features and the voice models of the known speakers. The voice models in the reference content database 214 may be generated using voice processing techniques such as Gaussian mixture models, hidden Markov models, decision trees, neural networks, frequency estimation, etc).  

As to claim 4, the rejection of claim 3 is incorporated.  Yun and Kim further teach 
wherein the machine-learning engine includes a trained machine- learning program that has been trained based on a body of previous content generated by the speaker
(Yun: see ¶ 0071-0072; The voice/speech analysis unit 1010/1020 also accesses the voice model database in the reference content database 214 and identifies one or more speakers based on the extracted voice features and the voice models of the known speakers. The voice models in the reference content database 214 may be generated using voice processing techniques such as Gaussian mixture models, hidden Markov models, decision trees, neural networks, frequency estimation, etc).  

As to claim 5, the rejection of claim 1 is incorporated.  Yun and Kim further teach 
wherein the identified audio characteristics of the speaker audio data is analyzed to identify key portions of the presentation digital data. (Yun: see ¶ 0051; the audio contents of Channel 60 may be analyzed to recognize speech of commentators, background sounds such as crowd noise, music, etc. Based on the scene analysis, the channel analysis unit 212 may generate a content tag for Channel 60 including the recognized context and/or information on the contents of the channel program such as names of the players, a penalty shootout context, teams, scores, etc.  Kim:  ¶ 0040-0043; speech analysis module 220 further processes the audio data to identify cues in the audio data associated with an emotion or context of the audio data. For example, speech analysis module 220 may identify changes in pitch or volume in the audio data as cues associated with the user's emotion or a context of a user's statement. In certain embodiments, speech analysis module 220 may also output data as to a determined probability of a trait of the speaker. For example, the output of speech analysis module 220 may indicate that there is a 75% probability that the recorded speaker is male). Thus, combining Yun and Kim would meet the claimed limitations for the same reasons as set forth in claim 1.

As to claim 6, the rejection of claim 1 is incorporated.  Yun and Kim further teach 
 wherein the generated tags include keywords associated with the identified portions of the presentation digital data (Yun: see ¶ 0072; the channel analysis unit 212 may include the speech analysis unit 1020 to recognize speech in the audio contents. Upon receiving audio contents of the channel programs, the speech analysis unit 1020 performs speech recognition on the audio contents to recognize key words or phrases. In one embodiment, the spoken words in the audio contents may be recognized based on speech features extracted from the audio contents and acoustic models stored in the reference content database 214. For each of the channel programs, the speech analysis unit 1020 extracts speech features in the audio contents of the channel programs. The speech analysis unit 1020 accesses the acoustic model database in the reference content database 214 and identifies key words or phrases based on the extracted speech features and the acoustic models. The acoustic models in the reference content database 214 may be generated using speech processing techniques such as Gaussian mixture models, hidden Markov models, decision trees, neural networks, frequency estimation, etc).  

As to claim 7, the rejection of claim 1 is incorporated.  Yun and Kim further teach 
wherein the identified portions of the presentation digital data are delimited using timestamps (Yun: see ¶ 0082; The scene recognition unit 1060 is configured to receive video and audio contents and recognize an event in the video and audio contents by performing a scene analysis. The scene analysis may include a speech analysis, a sound analysis, a face recognition, and an object recognition, or a combination thereof. In one embodiment, the scene recognition unit 1060 may perform the scene analysis by using or coordinating with the voice analysis unit 1010, the speech analysis unit 1020, the audio fingerprint analysis unit 1030, the video fingerprint analysis unit 1040, and the face recognition unit 1050, or any combination thereof. Alternatively, one or more of these units 1010 to 1050 may be provided directly in the scene recognition unit 1060 for use in the scene analysis. As used herein, the term "event" refers to any occurrence having a specified context such as a location, time, etc., and may include events such as a sporting event, a live-cast event (e.g., a concert, a musical, an on-the-scene news report, etc.), or the like. ¶ 0036; A content tag for a channel program indicates the context for the channel program and may be associated with identification information of the channel program such as a channel number, a name of a channel program, a name of a broadcast station, and the like. In some embodiments, a content tag may include any information characterizing video and/or audio contents of a channel program such as a name of an actor or an actress, a name of a singer, a topic of speech, a name of a soundtrack, an exciting event in the channel program, and so forth).  

As to claim 8, the rejection of claim 1 is incorporated.  Yun and Kim further teach 
generating a transcript of the speaker audio data to allow for analysis of speech content. (Yun: see ¶ 0072-0073; the speech analysis unit 1020 to recognized speech in the audio content; A reference speech database in the reference content database 214 includes a plurality of reference key words and speech content information (e.g., a title, a related topic, a quote, etc.) associated with the reference key words. The text or data of the recognized words may be compared to reference key words in the reference speech database. For example, one or more reference key words (e.g., "husband," "wife," "family," etc.) may indicate speech content information such as topic (e.g., "relationship"). Based on the recognized text or data of the spoken words, the speech analysis unit 1020 may access the reference speech database and identify key words corresponding to the recognized text or data. The speech content information associated with one or more identified key words may then be provided in a content tag)

As to claim 9, the rejection of claim 1 is incorporated.  Yun and Kim further teach  wherein the identified portions of the presentation digital data and the associated tags are used to generate a summary of the presentation digital data (Yun: see Fig. 11 and ¶ 0087-0088; recognize that the video and audio contents include an exciting event, and provide an indication of the exciting event to be included in a content tag. ¶ 0077; The reference video fingerprint database includes a plurality of reference video fingerprints for a plurality of video contents and video content information associated with the reference video fingerprints (e.g., a title, a playing time, names actors and actress recognized in a particular scene, a summary, etc.). Examiner’s note: the limitation “are used to generate a summary of the presentation digital data” is intended use limitation; the limitation does not impart a patentable distinction because it simply expresses the intended use of the identified portions of the presentation digital data and associated tags).

As to claim 10, the rejection of claim 1 is incorporated.  Yun and Kim further teach 
wherein the identified portions of the presentation digital data and the associated tags are used to generate recommendations for related presentation digital data (Yun: see ¶ 0038, 0040, 0100; the content tags may be transmitted to the channel program recommendation unit 220 as they are generated. Examiner’s note: the limitation “are used to generate recommendations for related presentation digital data” is intended use limitation; the limitation does not impart a patentable distinction because it simply expresses the intended use of the identified portions of the presentation digital data and associated tags).  

As to claim 11, the rejection of claim 1 is incorporated.  Yun and Kim further teach 
causing presentation on a user interface to enable searching of the presentation digital data using the tags (Yun: see Fig. 11 and ¶ 0090; the generated content tags 1110 and 1120 for the channels are then transmitted to the channel program recommendation unit 220 for generating a channel program recommendation for the user to search for the portions of interest).  

As to claim 12, the rejection of claim 1 is incorporated.  Yun and Kim further teach 
extracting audience audio data from the audio data of the presentation digital data; analyzing the audience audio data to identify a characteristic of the audience audio data; and identifying the portions of the presentation digital data based on changes in the characteristic of the audience audio data (Yun: see ¶ 0051, 0085, 0087; the audio contents of Channel 60 may be analyzed to recognize speech of commentators, background sounds such as crowd noise, music, etc. Based on the scene analysis, the channel analysis unit 212 may generate a content tag for Channel 60 including the recognized context and/or information on the contents of the channel program such as names of the players, a penalty shootout context, teams, scores, etc).  

As to claim 13, the rejection of claim 12 is incorporated.  Yun and Kim further teach  wherein the characteristic of the audience audio data comprises at least one of a favorable audience reaction and or an unfavorable audience reaction (Yun: see ¶ 0051, 0085; the audio contents of Channel 60 may be analyzed to recognize speech of commentators, background sounds such as crowd noise, music, etc. Based on the scene analysis, the channel analysis unit 212 may generate a content tag for Channel 60 including the recognized context and/or information on the contents of the channel program such as names of the players, a penalty shootout context, teams, scores, etc. ¶ 0087;  crowd noise may indicate an exciting event according to the intensity of the noise. In this case, if the noise is louder, the scene recognition unit 1060 may determine that the associated event is an exciting event).  

As to claim 14, the rejection of claim 1 is incorporated.  Yun and Kim further teach 
extracting presenter video data from the video data of the presentation digital data (Yun: see a content tag may include any information characterizing video and/or audio contents of a channel program such as a name of an actor or an actress, a name of a singer, a topic of speech, a name of a soundtrack, an exciting event in the channel program, and so forth. ¶ 0063; a face recognition may be performed on the video contents of the channel programs to identify Actor A and Actor B for Channel 1. The face recognition extracts facial features of Actors A and B in the video contents and compares the extracted facial features and reference facial features of faces stored in the reference content database 214. For example, if the channel analysis unit 212 determines that the reference facial features for Actor A are most similar to the extracted facial features, it generates a content tag for Channel 1 including the name of Actor A. In this embodiment, the channel analysis unit 212 may generate a content tag for Channel 1 including the names of Actors A and B based on the face recognition analysis); 
analyzing the presenter video data to identify the video characteristic of the presenter video data (Yun: ¶ 0063; a face recognition may be performed on the video contents of the channel programs to identify Actor A and Actor B for Channel 1. The face recognition extracts facial features of Actors A and B in the video contents and compares the extracted facial features and reference facial features of faces stored in the reference content database 214. For example, if the channel analysis unit 212 determines that the reference facial features for Actor A are most similar to the extracted facial features, it generates a content tag for Channel 1 including the name of Actor A. In this embodiment, the channel analysis unit 212 may generate a content tag for Channel 1 including the names of Actors A and B based on the face recognition analysis); and 
identifying the portions of the presentation digital data based on a change in the video characteristic of the presenter video data (Yun: ¶ 0063, 0080; The face recognition unit 1050 is configured to receive and analyze the video contents of the channel programs, and performs facial recognition by accessing a facial feature database in the reference content database 214. In one embodiment, the face recognition unit 1050 may recognize a person (e.g., an actor, an actress, a singer, etc.) by detecting a face in the video contents and extracting facial features of the detected face. The facial feature database includes reference facial features of a plurality of people and identities (e.g., names) of the persons associated with the facial features).  

As to claim 15, the rejection of claim 14 is incorporated.  Yun and Kim further teach 
wherein the video characteristic of the presenter video data comprises at least one of a motion characteristic or an expression characteristic related to the presenter as depicted within the presenter video data (Yun: see ¶ 0078; motion changes. ¶ 0087; exciting event can be detected using video fingerprint analysis, face recognition technique. Kim:  ¶ 0040-0043; speech analysis module 220 further processes the audio data to identify cues in the audio data associated with an emotion or context of the audio data. For example, speech analysis module 220 may identify changes in pitch or volume in the audio data as cues associated with the user's emotion or a context of a user's statement. In certain embodiments, speech analysis module 220 may also output data as to a determined probability of a trait of the speaker. For example, the output of speech analysis module 220 may indicate that there is a 75% probability that the recorded speaker is male). Thus, combining Yun and Kim would meet the claimed limitations for the same reasons as set forth in claim 1.  

Claim 4 is alternatively rejected under 35 U.S.C. 103 as being unpatentable over Yun and Kim in view of Aharoni et al. (US 2021/0090570 A1; hereinafter as Aharoni).

As to claim 4, the rejection of claim 3 is incorporated.  Aharoni is relied upon for alternatively teaching the limitations: machine-learning engine includes a trained machine- learning program that has been trained based on a body of previous content generated by the speaker (see ¶ 0008; to generate an intent {~tag/metadata}, the automated calling system may provide the model with data related to the context of the telephone call, any previous intents from the telephone call, and the audio of the human's speech to which the bot should respond. The automated calling system may train this model using machine learning and previous telephone conversations related to completing tasks that the bot may be performing. ¶ 0010-0011, 0052; the action of generating the synthesized speech of the reply by the bot to the utterance includes accessing historical data for previous telephone conversation, where in the historical data includes, for each previous telephone conversation, (i) a previous context of the previous telephone conversation, (ii) previous first speaker intents of portions of the previous telephone conversation spoken by a first speaker, (iii) previous second speaker intents of portions of the previous telephone conversation spoken by the second speaker, (iv) previous audio data of a most recent utterance of the first speaker of the second speaker during the previous telephone conversation, and (v) a previous intent of a previous reply to the most recent utterance; and training, using machine learning and the historical data, a model that is configured to receive (i) audio data of a most recent given utterance of a given telephone conversation, (ii) a given user intent of a first portion of the given telephone conversation spoken by a given user, (iii) a given bot intent of a second portion of the given telephone conversation outputted by the speech synthesizer of the bot, and (iv) a given context of the given telephone conversation and output a given intent for a given reply to the most recent given utterance).
The references, each discloses a system for identifying speech data using machine learning techniques.  Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention to have modified the machine learning technique disclosed in Yun/Kim to include the features of training machine learning models using previous audio data as disclosed by Aharoni so that speech audio data can be analyzed as claimed.  One would be motivated to make such a combination because of the overlapping subject matter and the advantages provided by Aharoni that using the training data to process and analyze natural language data more accurately (Aharoni: see ¶ 0002).

Claims 12-13 are alternatively rejected under 35 U.S.C. 103 as being unpatentable over Yun and Kim in view of Kondo et al. (US 2004/0117815 A1; hereinafter as Kondo).

As to claim 12, the rejection of claim 1 is incorporated.  Kondo is relied upon for teaching the limitations.  Specifically, Kondo discloses a method and a system relates to an audience state estimation comprising: extracting audience audio data from the audio data of the presentation digital data (Kondo: see Col. 2, lines 30-40; the sound-obtaining device obtains sound from the audience and generates the audio signal according to the sound thus obtained); analyzing the audience audio data to identify a characteristic of the audience audio data (Kondo: see Col. 2, lines 30-40; estimation device estimates an audience state such as a state of laughing based on the volume); and identifying the portions of the presentation digital data based on changes in the characteristic of the audience audio data (Kondo: see Figs. 24A-25B and Col. 11, line 62 through Col. 12, line 30; when the characteristic amount 302 is larger than reference level Lv2, the state 22A1 of “clapping” is estimated).  
The references, each discloses a system for analyzing audio data.  Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention to have modified the audio data analysis disclosed in Yun/Kim to include the features of analyzing audience audio data as disclosed by Kondo so that audience audio data can be extracted and analyzed as claimed.  One would be motivated to make such a combination because of the overlapping subject matter and the advantages provided by Kondo that analyzing audience audio data to learn about the state of the audience helps effectively provide contents to the audience (Kondo: see Col. 1, lines 25-36).

As to claim 13, the rejection of claim 12 is incorporated.  Yun/Kim/Kondo further teach wherein the characteristic of the audience audio data comprises at least one of a favorable audience reaction and an unfavorable audience reaction (Kondo: see Col. 2, lines 30-59; the estimation device estimates an audience state such as a state of laughing based on the volume, state of clapping based on the sound periodicity {~favorable audience reaction}).  
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention to have modified the audio data analysis disclosed in Yun/Kim to include the features of analyzing audience audio data as disclosed by Kondo so that audience audio data can be extracted and analyzed as claimed.  One would be motivated to make such a combination because of the overlapping subject matter and the advantages provided by Kondo that analyzing audience audio data to learn about the state of the audience helps effectively provide contents to the audience (Kondo: see Col. 1, lines 25-36).

Claims 16-18 are rejected under 35 U.S.C. 103 as being unpatentable over Yun and Kim in view of Castaneda et al. (US 2019/0155949 A1; hereinafter as Castaneda).

As to claim 16, the rejection of claim 1 is incorporated.  Yun and Kim further teache:
wherein the identified portions of the presentation digital data comprise secondary content related to primary content (Yun: see Fig. 4 and ¶ 0050; recommendation window 410);
causing presentation of a graphical user interface (GUI) on a display screen, the GUI depicting the primary content (Yun: see Fig. 4 and ¶ 0050; recommendation window 410); 
causing presentation within the GUI of an indicator corresponding to a portion of the primary content, the indicator indicating availability of related secondary content of the secondary content, related to the portion of the primary content (Yun: see Fig. 4-6 and ¶ 0050-0057; recommendation window 410);
detecting user selection of the indicator (Yun: see Fig. 4-6 and ¶ 0050-0057; recommendation window 410);
responsive to the detection of the user selection of the indicator, causing presentation within the GUI of a plurality of secondary content identifiers that are user selectable to access the related secondary content, related to the portion of the primary content (Yun: see Fig. 4-6 and ¶ 0050-0057; recommendation window 410).
Alternatively, Castaneda is relied upon for teaching the following limitations:
wherein the identified portions of the presentation digital data comprise secondary content related to primary content (Castaneda: see Fig. 2 and ¶ 0043; Text 240 for the ebook is shown with some additional enhancements. For example, a phrase “The Beast” is shown in bold. A box 250 highlights a phrase “be our guest” in the text 240 that has been selected by a reader. There is also shown a selectable option 260 for related content. The phrase “The Beast” is shown in bold to signify its importance and also to depict a source or type of associated related content that has been identified by the media guidance application. Highlighted phrases or terms in the ebook can be depicted in a number of ways so that the user can see that different types of media may be associated with the phrase. For example, different colors or text styles may be used to signify that there may be a related movie, television show, video clip, news article, blog commentary, social network channels, user generated video commentary, etc. The highlighted phrases can be significant for different reasons. In some examples, the highlighted phrase may indicate that it has some related content and that if the phrase is selected, some additional content about the phrase can be accessed. The words in the ebook file that are shown with a highlight can be selected for display in a suitable manner by the media guidance application using information about the text from the ebook manifest file, and using metadata for the ebook. For example, a media guidance application may retrieve details about supplemental content for the ebook from a media database and identify terms to highlight in the ebook by matching supplemental content items with ebook text information from the manifest file for the ebook. ¶ 0060; interactive media guidance applications may take various forms depending on the content for which they provide guidance; the terms “media assets” and “content” should be understood to mean an electronically consumable user asset such as television programming…video clips, audio), and the method comprises: 
causing presentation of a graphical user interface (GUI) on a display screen, the GUI depicting the primary content (Castaneda: see Fig. 2 and ¶ 0043; Text 240 for the ebook is shown with some additional enhancements); 
causing presentation within the GUI of an indicator corresponding to a portion of the primary content, the indicator indicating availability of related secondary content of the secondary content, related to the portion of the primary content (Castaneda: see Fig. 2 and ¶ 0043; Text 240 for the ebook is shown with some additional enhancements. For example, a phrase “The Beast” is shown in bold. A box 250 highlights a phrase “be our guest” in the text 240 that has been selected by a reader. There is also shown a selectable option 260 for related content. The phrase “The Beast” is shown in bold to signify its importance and also to depict a source or type of associated related content that has been identified by the media guidance application. Highlighted phrases or terms in the ebook can be depicted in a number of ways so that the user can see that different types of media may be associated with the phrase. For example, different colors or text styles may be used to signify that there may be a related movie, television show, video clip, news article, blog commentary, social network channels, user generated video commentary, etc. The highlighted phrases can be significant for different reasons. In some examples, the highlighted phrase may indicate that it has some related content and that if the phrase is selected, some additional content about the phrase can be accessed. The words in the ebook file that are shown with a highlight can be selected for display in a suitable manner by the media guidance application using information about the text from the ebook manifest file, and using metadata for the ebook. For example, a media guidance application may retrieve details about supplemental content for the ebook from a media database and identify terms to highlight in the ebook by matching supplemental content items with ebook text information from the manifest file for the ebook); 
detecting user selection of the indicator (Castaneda: see Figs. 2, 4, 5 and ¶ 0047-0051; Supplemental content related to the ebook may also be obtained by selecting an option 460 and by selecting certain text in the ebook display shown in a distinctive manner, e.g., by selecting “The Beast” which is highlighted to show that it has some related content. “The Beast” may be highlighted in a distinctive manner to indicate that it has supplemental content from a certain source or of a particular type, such as a movie or video clip, or a social network channel, or user generated commentary, etc.  [0049] When a query for supplemental content is created using the media guidance application by, for example, receiving a selection of an option for related content using buttons 260 (FIG. 2) or 460 (FIG. 4), or by selection of portions of text 250 or 450 (FIGS. 2 and 4, respectively), or by selection of text portions designated as having related content, a remote database of media content may be searched for supplemental content, or predefined links to supplemental content. The remote databases may be ebook or media databases that are accessible from the media guidance application via a home network (e.g. LAN) or the Internet. The searches may be performed by the media guidance application in one or more media databases in order to obtain search results. The search query may be created using selected text as well as information about an ebook, in particular location within the ebook.  Fig. 5 and ¶ 0051; presenting of supplemental content item/index as shown in Fig. 5; such as items 510, 520, 530, 540, 550); 
responsive to the detection of the user selection of the indicator, causing presentation within the GUI of a plurality of secondary content identifiers that are user selectable to access the related secondary content, related to the portion of the primary content (Castaneda: see Figs. 2, 4, 5 and ¶ 0047-0051; Supplemental content related to the ebook may also be obtained by selecting an option 460 and by selecting certain text in the ebook display shown in a distinctive manner, e.g., by selecting “The Beast” which is highlighted to show that it has some related content. “The Beast” may be highlighted in a distinctive manner to indicate that it has supplemental content from a certain source or of a particular type, such as a movie or video clip, or a social network channel, or user generated commentary, etc.  [0049] When a query for supplemental content is created using the media guidance application by, for example, receiving a selection of an option for related content using buttons 260 (FIG. 2) or 460 (FIG. 4), or by selection of portions of text 250 or 450 (FIGS. 2 and 4, respectively), or by selection of text portions designated as having related content, a remote database of media content may be searched for supplemental content, or predefined links to supplemental content. The remote databases may be ebook or media databases that are accessible from the media guidance application via a home network (e.g. LAN) or the Internet. The searches may be performed by the media guidance application in one or more media databases in order to obtain search results. The search query may be created using selected text as well as information about an ebook, in particular location within the ebook.  Fig. 5 and ¶ 0051; presenting of supplemental content item/index as shown in Fig. 5; such as items 510, 520, 530, 540, 550).  
The references, each discloses a user interface for presenting media content to the user. therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention to have modified the user interface disclosed in Yun/Kim to include the features of presenting the indicators that links to related/supplemental information as disclosed by Castaneda so that the user can access related information/content as claimed.  One would be motivated to make such a combination because Castaneda suggests that the displayed content can be in any format including video and/or audio (see ¶ 0060) and the advantages provided by Castaneda that provides a mechanism for the user to access additional content related to the highlighted text to enhance the reader’s reading experience (Castaneda: see ¶ 0001).  

As to claim 17, the rejection of claim 16 is incorporated.  Yun/Kim/Castaneda further teach wherein metadata is presented within the GUI in association with the plurality of secondary content identifiers to enable a user to filter the plurality of secondary content identifiers based on the metadata (Castaneda: Fig. 5 and ¶ 0051; presenting of supplemental content item/index as shown in Fig. 5; such as items 510, 520, 530, 540, 550.  ¶¶ 0019-0020; a user may select one of the search results in the display, and the media guidance application will generate a display of supplement content item that corresponds to the selected search result. For example, a user may select a movie or television episode from the list of search results and the media guidance application will generate a display of the selected media item by requesting the selected media item from a media source).  Thus, combining Yun/Kim/Castaneda would meet the claimed limitations for the same reasons as set forth in claim 16.

As to claim 18, the rejection of claim 17 is incorporated.  Yun/Kim and Castaneda 
the metadata comprises the associated tags (Yun: ¶ 0037-0038; the content tags may be transmitted to the channel program recommendation unit 220 as they are generated).  

Response to Arguments
Applicant’s arguments with respect to claims 1-20 have been considered but are not persuasive.
Applicants argued that the cited art fails to disclose or suggest analysis of video data to identify a video characteristic that comprises a “non-verbal cue”. Specifically, Applicants recites paragraphs [0054] and [0055] and argued that the term “non-verbal cue” as used in the claims refers to behavioral cues such as gestures, posture changes, or facial expressions that convey emphasis or meaning, not mere identification of a person. (see Remark pages 7-8)
In response, the examiner respectfully disagrees.  The claims do not specify what a non-verbal cue is; the specification provides some examples of what the non-verbal cue can be (see published specification paragraphs [0244, 0054-0055); however, the specification does not limit the verbal cues to be “gestures, posture changes, or facial expressions that convey emphasis or meaning”. For clarity, paragraph [0244] is copied down as follows:
“In Example 24, the subject matter of Examples 1-23 includes, wherein the analyzer operates to identify the characteristic of the video data as one or more of facial expressions, body language, and other non-verbal cues”. 

As recited, the non-verbal cues are not limited to “facial expressions”, “body language” because paragraph [0244] recites the term “other non-verbal cues”.  
In addition, paragraph 0054-0055 only recites non-limiting examples of what video data the video analyzer analyzes.  Paragraph 0054 is copied as follows:
[0054] The analyzer/connector engine 218 also includes a video analyzer 304 that includes algorithms to perform several analytic operations on video data accessed at a third-party systems 112. For example, the video analyzer 304 analyzes video data to identify a speaker within a YouTube video, and then analyzes movement on a stage of that speaker to identify key or important portions of the presentation that the speaker may have intended to emphasize. Here, the machine-learning engine 1500 again assists the video analyzer 304 by constructing a model, including a trained machine-learning program 1510, for a particular speaker and be trained to identify that the speaker characteristically stands up (or performs some other motion) when making a key point or wishing to particularly engage with an audience. The portion of the video where the speaker is then standing may be delimited (e.g., the begin and end timestamps recorded) and tagged as being important based on this analysis of the video. In a similar way, the video analyzer 304 may analyze an expression 314 of a speaker, based on a model of that speaker (or a more generalized model) to identify and delimit key portions of a presentation, and tag or generate other metadata pertaining to those key portions.

For these reasons, the term “non-verbal cues” is interpreted to be any cue that is NOT verbal which can be cue identified from the video content.
In this case, Yun discloses video analysis including analyzing video content using any suitable video analysis methods such as video fingerprint analysis, a scene analysis, a face recognition analysis, and the like (see ¶ 0037, 0049, 0051, 0070).  Yun further discloses a face recognition may be performed on the video contents of the channel programs to identify Actor A and Actor B for Channel 1. The face recognition extracts facial features of Actors A and B in the video contents and compares the extracted facial features and reference facial features of faces stored in the reference content database 214 (see ¶ 0063, 0080). The examiner’s position is that “facial features extracted from the face recognition or video content analysis” read on the non-verbal cues because those are the visual cues.  If the applicants intended for the term “non-verbal cues” to be limited to “behavioral cues such as gestures, posture changes, or facial expressions that convey emphasis or meaning”; the applicants are advised to amend the claims to recite such limitations.
The examiner further notes that the cited reference Kim discloses multiple examples of non-verbal cues such as data indicating the presence of “rolled eyes” or smiles in an image (see ¶ 0067, 0089, 0091), facial expression and gestures (see ¶ 0074).
For at least these reasons, the examiner maintains that the combined teaching of Yun and Kim renders obvious the disputed feature “non verbal cues”.


Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.

The prior art made of record on form PTO-892 and not relied upon is considered pertinent to applicant's disclosure.  Applicant is required under 37 C.F.R. § 1.111(c) to consider these references fully when responding to this action. For example:
Keller (US 8542205 B1) – outputting different content on a touch-sensitive display of a device based at least in part on an amount of force applied to the touch-sensitive display. For instance, when a user reads an electronic book (eBook) on a device having a touch-sensitive display, the user may make a selection of a word or phrase within the eBook by touching the display at a location of the word or phrase. In response, the techniques may output information associated with the selected word. For instance, the device may output, in response, a dictionary definition of the selected word, a picture associated with the selected word, synonyms of the selected, or the like. Thereafter, the user may apply a greater or lesser amount of force to the selected word and, in response, the device may output other instances or uses of the selected wordIt is noted that any citation to specific, pages, columns, lines, or figures in the prior art references and any interpretation of the references should not be considered to be limiting in any way.  A reference is relevant for all it contains and may be relied upon for all that it would have reasonably suggested to one having ordinary skill in the art.  In re Heck, 699 F.2d 1331, 1332-33,216 USPQ 1038, 1039 (Fed. Cir. 1983) (quoting In re Lemelson, 397 F.2d 1006,1009, 158 USPQ 275,277 (CCPA 1968)).

Any inquiry concerning this communication or earlier communications from the examiner should be directed to TUYETLIEN T TRAN whose telephone number is (571)270-1033. The examiner can normally be reached on Monday-Friday from 8:00 AM to 5:00 PM.
If attempts to reach the examiner by telephone are unsuccessful, the examiner' s supervisor, Stephen Hong, can be reached at telephone number 571-272-4124. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from Patent Center and the Private Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from Patent Center or Private PAIR. Status information for unpublished applications is available through Patent Center and Private PAIR for authorized users only. Should you have questions about access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) Form at https://www.uspto.gov/patents/uspto-automated- interview-request-air-form.


/TUYETLIEN T TRAN/Primary Examiner, Art Unit 2179
Read full office action
Prosecution Timeline

Show 3 earlier events
Oct 11, 2024
Final Rejection mailed — §103
Dec 10, 2024
Response after Non-Final Action
Jan 07, 2025
Request for Continued Examination
Jan 13, 2025
Response after Non-Final Action
Jun 24, 2025
Non-Final Rejection mailed — §103
Sep 24, 2025
Response Filed
Oct 23, 2025
Final Rejection mailed — §103
Dec 29, 2025
Response after Non-Final Action
Precedent Cases

Applications granted by this same examiner with similar technology

18/518,488
Patent 12602153
SIGNAL TRACKING AND OBSERVATION SYSTEM AND METHOD
2y 4m to grant Granted Apr 14, 2026
18/310,939
Patent 12586104
OBJECT DISPLAY METHOD AND APPARATUS, ELECTRONIC DEVICE, AND COMPUTER READABLE STORAGE MEDIUM
2y 10m to grant Granted Mar 24, 2026
18/335,978
Patent 12585376
SYSTEMS AND METHODS OF REDUCING OBSTRUCTION BY THREE-DIMENSIONAL CONTENT
2y 9m to grant Granted Mar 24, 2026
18/594,017
Patent 12585377
SYSTEM AND METHOD FOR HANDLING OVERLAPPING OBJECTS IN VISUAL EDITING SYSTEMS
2y 0m to grant Granted Mar 24, 2026
17/976,206
Patent 12573257
DIGITAL JUKEBOX DEVICE WITH IMPROVED USER INTERFACES, AND ASSOCIATED METHODS
3y 4m to grant Granted Mar 10, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

4-5
Expected OA Rounds
68%
Grant Probability
99%
With Interview (+33.5%)
3y 10m (~8m remaining)
Median Time to Grant
High
PTA Risk
Based on 642 resolved cases by this examiner. Grant probability derived from career allowance rate.
AUTOMATED SEGMENTATION OF DIGITAL PRESENTATION DATA

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

Strategy Recommendation AI-generated — please review before filing

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email