DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
Applicant’s Amendments to the Claims, filed August 6, 2025, have been entered. Claims 1-10 have been amended, claims 12-17 have been added, and claims 1-17 are currently pending. The Amendments to the Specification, filed August 6, 2025, have been entered.
Response to Arguments
Applicant's arguments filed August 6, 2025 have been fully considered but they are not persuasive. Applicant argues that cited prior art Liu (Pub. No. US 2017/0270203 A1, hereinafter “Liu”) does not teach a streaming composite media file comprising a plurality of component media streams, as recited in claim 1 (Remarks pp. 13-14). Specifically, Applicant argues that Liu only teaches searching segments of multiplexed audio/video content (Remarks pp. 14). In response, examiner respectfully submits that Liu teaches a streaming composite media file. Claim 1 recites “a method of searching for one or more search words in at least one streaming composite media file, each at least one streaming composite media file comprising a plurality of component media streams” (emphasis added), and Applicant’s Specification [0016] (US 20230325433 A1) provides an examples of a composite, which includes an audio/video stream, a secondary video stream (i.e. a video stream), a screen capture or a presentation slide display stream, and a secondary audio stream (i.e. an audio stream). Liu teaches that a video segment can include video frames, audio scenes, and video scenes, and that the audio content can be extracted from the video to create the transcript (Liu [0048-0049). In other words, the video and audio content are utilized as separate streams (see Applicant’s Specification [0037] and Fig. 3, where “Audio to Text”, i.e. a transcript, is utilized as a separate stream). Examiner interprets that the video segment discloses a streaming composite media file, comprising a video component media stream and an audio component media stream, i.e. a plurality of component media streams. Examiner notes that the streams listed in Applicant’s Specification [0016] are exemplary, and that Applicant’s Specification [0015] provides that “…a large portion of multimedia content comprises multiple streams of different data formats”.
Applicant argues that Liu does not teach “determining whether or not any of the one or more sought-after search words are present…on a time segment-by-time segment basis,” as recited in claim 1 (Remarks pp. 15-16). Specifically, Applicant argues that Liu teaches searching for video segments, and not time segments (Remarks p. 15). In response, examiner respectfully submits that Liu teaches searching for time segments in the audio component media stream and the video component media stream. Liu teaches that that a transcript is generated based on the audio content (Liu [0048]), and a time-aligned transcript is generated (Liu [0051]). Liu also teaches that the timestamped video content that matches the search query is identified based on the timestamped transcript (Liu [0084]), where a portion of the transcript can be associated with one or more timestamps of a video segment in the video that corresponds to the portion of the transcript, such as a timestamp that indicates the start or the end of the presentation of the video segment ([Liu [0049]). Examiner interprets that the audio component media stream is segmented by time through the timestamps in the transcript, and the video component media stream is segmented by time through the timestamps, which include timestamps that indicate the start or end of the video segment.
Applicant argues that Liu does not teach a significance value because Liu teaches only one kind of data stream (Remarks pp. 16-17). In response, examiner respectfully directs Applicant to the discussion above regarding where Liu discloses the video component media stream and the audio component media stream.
Applicant argues that Liu does not teach a match intensity value, specifically citing that Liu’s matching score is not a sum of only relevancy scores, but includes a combination of relevancy score plus popularity and recency scores (Remarks pp. 17-18). In response, examiner respectfully directs Applicant to Liu [0089], which indicates that a matching score can be calculated based on any suitable criteria, and as example can be based solely on the relevancy score. Further, claim 1 recites “totaling the significance values assigned to the matches…”, which is taught by Liu even where the matching score is based on the sum of relevancy scores (i.e. totaling the significance values) plus other values.
Applicant argues there is no motivation to combine the teachings of Liu with the teachings of Shekar et al. (Patent No. Us 10,311,913 B1, hereinafter “Shekar”) because Shekar summarizes video content based on consolidating memorable segments of the video content, and that there is generally a possibility of partitioning an input video into segments (Remarks pp. 18-19). Applicant further argues that there is no motivation to combine Liu’s focus on video content searching with the methods of summarizing video content (Remarks p. 19). In response, examiner respectfully submits that Shekar was cited as teaching “dividing the streaming composite media files into a plurality of time segments of equal length”, and the motivation to combine Liu and Shekar is to summarize video content in a way that distinguishes the video content (or its summary) from other videos competing for views’ attention (Non-Final Office Action pp. 15-16). Shekar distinguishes the video content in part by explicitly partitioning the input video into segments of equal length. Liu is directed to searching video content (Liu, Abstract), and does so by segmenting the video content to distinguish said content prior to searching.
Applicant argues that Liu does not teach the different data formats from each other set forth in amended claims 3 and 9 (Remarks p. 20). In response, examiner respectfully submits that Liu teaches that a video segment can include one or more video frames that correspond to one or more speech utterances (phrases, sentences, etc.), audio scenes, video scenes, and/or any other suitable portions of the video [0049]. A transcript can be generated by extracting audio content from the video, which can include transcoding [0048]. Also see [0051], where an internal video database can include one or more videos and metadata for each video, such as the title, the description entered by the video owner, and a number of format and locations where the video is available. Also, the audio signal is extracted from one or more videos. Examiner interprets that the video and audio content disclose at least one streaming composite media file, where the audio content discloses a component media stream and the video discloses a component media stream.
Applicant argues that Phillips et al. (Pub. No. US 2021/0193187 A1, hereinafter “Phillips”) does not teach the component media stream recited in claim 3 because the vector type in Phillips does not equate with component media streams (Remarks pp. 20-22). In response, examiner respectfully submits that Phillips teaches that a video is obtained, and video scene vectors are obtained based on the obtained video. The video scene vectors are aligned with the video scenes and the vectors are semantic representations of the video scenes (Phillips [0046].)
Based on amendments, the previous rejections have been withdrawn. However, upon further consideration, a new ground(s) of rejection is made in view of Suri et al. (Pub. No. US 2016/0034786 A1, hereinafter “Suri”).
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1, 2, 6 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Liu in view of Shekhar.
Regarding claim 1, Liu teaches:
searching for one or more search words in at least one streaming composite media file, each at least one streaming composite media file comprising a plurality of component media streams, the method of searching comprising: (Liu – Fig. 12 is a process for searching for video content. Process 1200 can begin by receiving a search query at 1202 [0090-0081]. A video segment can include one or more video frames that correspond to one or more speech utterances (phrases, sentences, etc.), audio scenes, video scenes, and/or any other suitable portions of the video [0049]. A transcript can be generated by extracting audio content from the video [0048]. Examiner interprets that the video and audio content disclose at least one streaming composite media file, where the audio content discloses a component media stream and the video discloses a component media stream.)
aligning each plurality of component media streams relative to one another and with respect to the same respective time domain, (Liu – Next, at 1204, process 1200 can search for video segments that match the search query. For example, process 1200 can access a database storing transcripts associated with video content and identify transcripts associated with video content and identify transcripts that match the search query. Process 1200 can then identify video segments that match the search query based on the identified transcripts. A video segment that matches a search query can be identified by searching for a portion of a transcript that contains one or more keywords associated with the search query, identifying one or more time stamps associated with the portion of the transcript (i.e. time domain), and identifying a segment of a video that corresponds to the portion of the transcript (i.e. aligning) based on the timestamps [0084]. A transcript associated with the video can be obtained by transcribing audio content associated with the video. The transcript can be generated by extracting audio content from the video (i.e. audio component media stream and video component media stream) processing the audio content (e.g., by segmenting, transcoding, filtering, etc. the audio content), converting the processed audio content to text using suitable speech recognition technique, and generating a transcript based on the text [0048]. Also see [0051], where a time-aligned transcript is generated.)
determining whether or not any of the one or more sought-after search words are present in the plurality of component media streams on a time segment-by-time segment basis (Liu – next, at 1204, process 1200 can search for video segments (i.e. time segment) that match the search query (i.e. sought-after search words). For example, process 1200 can access a database storing transcripts associated with video content and identify transcripts associated with video content and identify transcripts that match the search query. Process 1200 can then identify video segments that match the search query based on the identified transcripts. A video segment that matches a search query can be identified by searching for a portion of a transcript that contains one or more keywords associated with the search query, identifying one or more time stamps associated with the portion of the transcript, and identifying a segment of a video that corresponds to the portion of the transcript based on the timestamps [0084]. A segment of the video that matches the search query based on the time stamps associated with the portion of the transcript can be identified. Process 1300 can identify a first frame and a second frame of the video corresponding to a time stamp representative of a start time of the portion of the transcript and a time stamp representative of an end time of the portion of the transcript. The process can then identify a segment of the video including the first frame and the second frame as a video segment corresponding to the portion of the transcript. In a particular example, the boundaries of the segment of the video can be defined by the first frame and the second frame (i.e. time segment-by-time segment basis) [0108].)
in a time segment in which one or more of the sought-after search words are present, assigning a significance value to each match in that time segment based on a closeness of the match between the one or more sought-after search words and the corresponding one or more search words present in that time segment (Liu – process 1200 can calculate a matching score for each of the matching video segments that are identified at 1204 and can then select a set of matching video segments based on the matching scores [0088]. A matching score can be calculated based on any suitable criteria, including based on a relevancy score (i.e. significance value) indicative of a degree to which the video segment matches the search query. In a more particular example, a video segment that includes a greater number of search terms and/or keywords associated with the search query can be regarded as more relevant than a video segment that includes a fewer number of the search terms and/or keywords and can thus be assigned a higher relevance score [0089]. Also see matching score calculated based on a popularity score (i.e. significance value) and based on a recency score (i.e. significance value), where the matching score may be determined as a weighted sum of any combination of the relevancy score, popularity score, and recency score (i.e. significance values) [0090-0092].)
and totaling the significance values assigned to the matches in each respective time segment to obtain a match intensity value for that time segment, (Liu - process 1200 can calculate a matching score (i.e. match intensity value) for each of the matching video segments that are identified at 1204 and can then select a set of matching video segments based on the matching scores [0088]. A matching score can be calculated based on any suitable criteria, including based on a relevancy score, a popularity score, and a recency score. A matching score associated with a video segment can be a weighted sum (i.e. total), weighted average, and/or any other suitable combination of a relevancy score, a popularity score, a recency score, etc. associated with the video segment [0089-0092].)
the match intensity values for respective time segments can be compared to identify comparative relevance to the sought-after search words (Liu – the subset of the video segments van be selected by ranking the matching video segments by matching score (i.e. comparing match intensity values) and selecting multiple matching video segments that are associated with the top N highest matching scores [0088].)
Liu does not appear to teach:
wherein the time domain is divided into a plurality of time segments of equal length
However, Shekhar teaches:
wherein the time domain is divided into a plurality of time segments of equal length (Shekhar – in block 202, the process 200 involves accessing segments of an input video. The summarization engine partitions the input video into segments of equal length [Col. 6 lines 36-49].)
Accordingly, it would have been obvious to a person of ordinary skill in the art at the time the invention was effectively filed, having the teachings of Liu and Shekhar before them, to modify the system of Liu with the teachings of Shekhar as shown above. One would have been motivated to make such a modification to summarize video content in a way that distinguishes the video content (or its summary) from other videos competing for views’ attention (Shekhar - [Col. 1 lines 35-41]).
Regarding claim 2, Liu teaches:
wherein the plurality of component media streams includes a mix of data formats including one or more of video, audio, synchronized video and audio, and live computer screen capture streams (Liu – a video segment can include one or more video frames that correspond to one or more speech utterances (phrases, sentences, etc.), audio scenes, video scenes, and/or any other suitable portions of the video [0049]. A transcript can be generated by extracting audio content from the video [0048]. Also see [0051], where an internal video database can include one or more videos and metadata for each video, such as the title, the description entered by the video owner, and a number of format and locations where the video is available. Also, the audio signal is extracted from one or more videos. Examiner interprets that the video and audio content disclose at least one streaming composite media file, where the audio content discloses a component media stream and the video discloses a component media stream.)
Regarding claim 6, Liu teaches:
searching for one or more sought-after search words in a plurality of streaming composite media files each streaming composite media file comprising a plurality of component media streams, the method of searching comprising: (Liu – Fig. 12 is a process for searching for video content. Process 1200 can begin by receiving a search query at 1202 [0090-0081]. A video segment can include one or more video frames that correspond to one or more speech utterances (phrases, sentences, etc.), audio scenes, video scenes, and/or any other suitable portions of the video [0049]. A transcript can be generated by extracting audio content from the video [0048]. Examiner interprets that the video and audio content disclose at least one streaming composite media file, where the audio content discloses a component media stream and the video discloses a component media stream.)
for each of the plurality of composite media files, aligning the corresponding plurality of component media streams relative to one another and with respect to the same respective time domain, (Liu – Next, at 1204, process 1200 can search for video segments that match the search query. For example, process 1200 can access a database storing transcripts associated with video content and identify transcripts associated with video content and identify transcripts that match the search query. Process 1200 can then identify video segments that match the search query based on the identified transcripts. A video segment that matches a search query can be identified by searching for a portion of a transcript that contains one or more keywords associated with the search query, identifying one or more time stamps associated with the portion of the transcript (i.e. time domain), and identifying a segment of a video that corresponds to the portion of the transcript (i.e. aligning) based on the timestamps [0084]. A transcript associated with the video can be obtained by transcribing audio content associated with the video. The transcript can be generated by extracting audio content from the video (i.e. audio component media stream and video component media stream) processing the audio content (e.g., by segmenting, transcoding, filtering, etc. the audio content), converting the processed audio content to text using suitable speech recognition technique, and generating a transcript based on the text [0048]. Also see [0051], where a time-aligned transcript is generated.)
determining whether or not any of the sought-after search words are present in each of the plurality of component media streams on a time segment-by-time segment basis (Liu – next, at 1204, process 1200 can search for video segments (i.e. time segment) that match the search query (i.e. sought-after search words). For example, process 1200 can access a database storing transcripts associated with video content and identify transcripts associated with video content and identify transcripts that match the search query. Process 1200 can then identify video segments that match the search query based on the identified transcripts. A video segment that matches a search query can be identified by searching for a portion of a transcript that contains one or more keywords associated with the search query, identifying one or more time stamps associated with the portion of the transcript, and identifying a segment of a video that corresponds to the portion of the transcript based on the timestamps [0084]. A segment of the video that matches the search query based on the time stamps associated with the portion of the transcript can be identified. Process 1300 can identify a first frame and a second frame of the video corresponding to a time stamp representative of a start time of the portion of the transcript and a time stamp representative of an end time of the portion of the transcript. The process can then identify a segment of the video including the first frame and the second frame as a video segment corresponding to the portion of the transcript. In a particular example, the boundaries of the segment of the video can be defined by the first frame and the second frame (i.e. time segment-by-time segment basis) [0108].)
in a time segment where one or more of the sought-after search words are present, assigning a significance value to each match in that time segment based on a closeness of the match between the one or more sought-after search words and the corresponding one or more search words present in that time segment (Liu – process 1200 can calculate a matching score for each of the matching video segments that are identified at 1204 and can then select a set of matching video segments based on the matching scores [0088]. A matching score can be calculated based on any suitable criteria, including based on a relevancy score (i.e. significance value) indicative of a degree to which the video segment matches the search query. In a more particular example, a video segment that includes a greater number of search terms and/or keywords associated with the search query can be regarded as more relevant than a video segment that includes a fewer number of the search terms and/or keywords and can thus be assigned a higher relevance score [0089]. Also see matching score calculated based on a popularity score (i.e. significance value) and based on a recency score (i.e. significance value), where the matching score may be determined as a weighted sum of any combination of the relevancy score, popularity score, and recency score (i.e. significance values) [0090-0092].)
totaling the significance values in each respective time segment to obtain a match intensity value for that time segment (Liu - process 1200 can calculate a matching score (i.e. match intensity value) for each of the matching video segments that are identified at 1204 and can then select a set of matching video segments based on the matching scores [0088]. A matching score can be calculated based on any suitable criteria, including based on a relevancy score, a popularity score, and a recency score. A matching score associated with a video segment can be a weighted sum (i.e. total), weighted average, and/or any other suitable combination of a relevancy score, a popularity score, a recency score, etc. associated with the video segment [0089-0092].)
sorting pair values of [Match Intensity, Match Location] in descending order of match intensity, wherein Match Location is specified in terms of an Nth streaming composite media file and the time segment in that streaming composite media file corresponding to that match intensity, and wherein a higher match intensity value is indicative of a comparatively greater relevance to the sought-after search words in the corresponding time segment (Liu – in a more particular example, the subset of the video segments can be selected by ranking (i.e. sorting) the matching video segments by matching score and selecting multiple matching videos segments that are associated with the top N highest matching scores [0088].)
Liu does not appear to teach:
wherein the time domain across which the plurality of component media streams extends is divided into a plurality of time segments of equal length
However, Shekhar teaches:
wherein the time domain across which the plurality of component media streams extends is divided into a plurality of time segments of equal length (Shekhar – in block 202, the process 200 involves accessing segments of an input video. the summarization engine partitions the input video into segments of equal length [Col. 6 lines 36-49].)
Accordingly, it would have been obvious to a person of ordinary skill in the art at the time the invention was effectively filed, having the teachings of Liu and Shekhar before them, to modify the system of Liu with the teachings of Shekhar as shown above. One would have been motivated to make such a modification to summarize video content in a way that distinguishes the video content (or its summary) from other videos competing for views’ attention (Shekhar - [Col. 1 lines 35-41]).
Regarding claim 16, Liu teaches:
wherein at least some of the component media streams have different data formats from each other (Liu – a video segment can include one or more video frames that correspond to one or more speech utterances (phrases, sentences, etc.), audio scenes, video scenes, and/or any other suitable portions of the video [0049]. A transcript can be generated by extracting audio content from the video [0048]. Also see [0051], where an internal video database can include one or more videos and metadata for each video, such as the title, the description entered by the video owner, and a number of format and locations where the video is available. Also, the audio signal is extracted from one or more videos. Examiner interprets that the video and audio content disclose at least one streaming composite media file, where the audio content discloses a component media stream and the video discloses a component media stream.)
Claims 3, 4, 5 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Liu in view of Shekhar in view of Phillips.
Regarding claim 3, Liu teaches:
wherein at least some of the component media streams have different data formats from each other, wherein assigning a significance value corresponding to each match in a given time segment comprises assigning a significance value to each match in a given time segment based on a combination of 1) a closeness of the match between the one or more sought-after search words and the one or more search words determined to be present in that time segment, (Liu – process 1200 can calculate a matching score (i.e. match intensity value) for each of the matching video segments that are identified at 1204 and can then select a set of matching video segments based on the matching scores [0088]. A matching score can be calculated based on any suitable criteria, including based on a relevancy score (i.e. significance value) indicative of a degree to which the video segment matches the search query. In a more particular example, a video segment that includes a greater number of search terms and/or keywords associated with the search query (i.e. a closeness of the match) can be regarded as more relevant than a video segment that includes a fewer number of the search terms and/or keywords and can thus be assigned a higher relevance score [0089]. Also see matching score calculated based on a popularity score (i.e. significance value) and based on a recency score (i.e. significance value), where the matching score may be determined as a weighted sum of any combination of the relevancy score, popularity score, and recency score (i.e. significance values) [0090-0092]. Also see [0051], where an internal video database can include one or more videos and metadata for each video, such as the title, the description entered by the video owner, and a number of format and locations where the video is available. Also, the audio signal is extracted from one or more videos.)
Liu modified by Shekhar does not appear to teach:
and 2) in which component media stream each one or more search word is found
However, Phillips teaches:
and 2) in which component media stream each one or more search word is found (Phillips – the video scene vectors of each of the scenes may include vector types (i.e. component media stream) including an action, an object and a caption, and the processor may be further configured to execute the instructions to obtain vector type weights to be respectively applied based on vector types, based on the user query, and adjust the obtained vector score of each of the scenes, based on the obtained vector type weights [0009]. The video scene vectors may include video scenes (i.e., sequences of successive frames) that are segmented from the obtained video. The video scene vectors may further include videos that are aligned with the video scenes, and vector types of the vectors. The vectors are semantic representations of the video scenes. A vector type may be, for example, an action, an object, or a caption that describes what a corresponding vector is representing in a video scene [0046].)
Accordingly, it would have been obvious to a person of ordinary skill in the art at the time the invention was effectively filed, having the teachings of Liu, Shekhar and Phillips before them, to modify the system of Liu and Shekhar with the teachings of Phillips as shown above. One would have been motivated to make such a modification to find a right portion of a video that answer a query (Phillips [0001-0002]).
Regarding claim 4, Liu teaches:
wherein significance values are determined using a mathematical weighting system which uses: a relatively high weighting factor for an exact match in the same order found in a given time segment; a relatively intermediate weighting factor if all search words match but not in the same order; and a variable, relatively low weighting factor if only some of the sought-after search words match, (Liu – process 1200 can calculate a matching score (i.e. match intensity value) for each of the matching video segments that are identified at 1204 and can then select a set of matching video segments based on the matching scores [0088]. A matching score can be calculated based on any suitable criteria, including based on a relevancy score (i.e. significance value) indicative of a degree to which the video segment matches the search query. In a more particular example, a video segment that includes a greater number of search terms and/or keywords associated with the search query can be regarded as more relevant than a video segment that includes a fewer number of the search terms and/or keywords and can thus be assigned a higher relevance score [0089]. A matching score associated with a video segment can be a weighted sum, a weighted average, and/or any other suitable combination of a relevancy score, a popularity score, a recency score, etc. associated with the video segment (i.e. significance values determined using weighting system) [0090-0092].)
wherein the mathematical weighting system additionally: 1) uses weighting factors depending on the data format of the component media stream in which each one or more search word is found, based on an expectation that search word matches in some component media streams are comparatively more indicative of relevance than matches in other component media streams (Liu - a matching score can be calculated based on any suitable criteria, including based on a relevancy score (i.e. significance value) indicative of a degree to which the video segment matches the search query. In a more particular example, a video segment that includes a greater number (i.e. weight) of search terms and/or keywords associated with the search query (i.e. a closeness of the match) can be regarded as more relevant than a video segment that includes a fewer number of the search terms and/or keywords and can thus be assigned a higher relevance score [0089].)
Liu modified by Shekhar does not appear to teach:
and 2) uses weighting factors higher than the weighting factor for an exact match for search word matches when the sought-after search words are present in ambient data associated with the at least one streaming composite media file
However, Phillips teaches:
and 2) uses weighting factors higher than the weighting factor for an exact match for search word matches when the sought-after search words are present in ambient data associated with the at least one streaming composite media file (Phillips – the video scene vectors of each of the scenes may include vector types (i.e. component media stream) including an action, an object and a caption, and the processor may be further configured to execute the instructions to obtain vector type weights to be respectively applied based on vector types, based on the user query, and adjust the obtained vector score of each of the scenes, based on the obtained vector type weights [0009]. The video scene vectors may include video scenes (i.e., sequences of successive frames) that are segmented from the obtained video. The video scene vectors may further include videos that are aligned with the video scenes, and vector types of the vectors. The vectors are semantic representations of the video scenes. A vector type may be, for example, an action, an object, or a caption that describes what a corresponding vector is representing in a video scene [0046]. Also see [0086], where process can compare the mood associated with the search with metadata (i.e. ambient data) associated with the video and determine whether the video matches the mood associated with the search query.)
Accordingly, it would have been obvious to a person of ordinary skill in the art at the time the invention was effectively filed, having the teachings of Liu, Shekhar and Phillips before them, to modify the system of Liu and Shekhar with the teachings of Phillips as shown above. One would have been motivated to make such a modification to find a right portion of a video that answer a query (Phillips [0001-0002]).
Regarding claim 5, Liu modified by Shekhar does not appear to teach:
generating searchable dynamic text characterizations or descriptions of the content of each component media stream, aligned in time with each respective component media stream
However, Phillips teaches:
generating searchable dynamic text characterizations or descriptions of the content of each component media stream, aligned in time with each respective component media stream (Phillips – the aligning module obtains the video scenes aligned with the captions and the descriptors (i.e. dynamic text characterizations or descriptions of content), and aligns the obtained video scenes with the captions and the obtained descriptors so that the obtained video scenes are respectively matched in time with the actions and the obtained descriptors. The obtained video scenes may be aligned with the captions and the obtained descriptors, using timestamps at which the video scenes, captions and descriptors are respectively obtained [0073].)
Accordingly, it would have been obvious to a person of ordinary skill in the art at the time the invention was effectively filed, having the teachings of Liu, Shekhar and Phillips before them, to modify the system of Liu and Shekhar with the teachings of Phillips as shown above. One would have been motivated to make such a modification to find a right portion of a video that answer a query (Phillips [0001-0002]).
Regarding claim 14, Liu teaches:
wherein the ambient data associated with the at least one streaming composite media file includes at least one of: 1) a title of the at least one streaming composite media file; and 2) a title of a continually displayed document or presentation slide in one of the component media streams (Liu – an internal video database can include one or more videos and metadata (i.e. ambient data) for each video, such as the title, the description entered by the video owner, and a number of formats and locations where the video is available [0051].)
Claims 7, 8, 12, 13 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Liu in view of Shekhar in view of Suri.
Regarding claim 7, Liu teaches:
searching for one or more sought-after search words in a plurality of streaming composite media files, each streaming composite media file comprising a plurality of component media streams, the method of searching comprising: (Liu – Fig. 12 is a process for searching for video content. Process 1200 can begin by receiving a search query at 1202 [0090-0081]. A video segment can include one or more video frames that correspond to one or more speech utterances (phrases, sentences, etc.), audio scenes, video scenes, and/or any other suitable portions of the video [0049]. A transcript can be generated by extracting audio content from the video [0048]. Examiner interprets that the video and audio content disclose at least one streaming composite media file, where the audio content discloses a component media stream and the video discloses a component media stream.)
for each of the plurality of composite media files, aligning the corresponding plurality of component media streams relative to one another and with respect to the same respective time domain, (Liu – Next, at 1204, process 1200 can search for video segments that match the search query. For example, process 1200 can access a database storing transcripts associated with video content and identify transcripts associated with video content and identify transcripts that match the search query. Process 1200 can then identify video segments that match the search query based on the identified transcripts. A video segment that matches a search query can be identified by searching for a portion of a transcript that contains one or more keywords associated with the search query, identifying one or more time stamps associated with the portion of the transcript (i.e. time domain), and identifying a segment of a video that corresponds to the portion of the transcript (i.e. aligning) based on the timestamps [0084]. A transcript associated with the video can be obtained by transcribing audio content associated with the video. The transcript can be generated by extracting audio content from the video (i.e. audio component media stream and video component media stream) processing the audio content (e.g., by segmenting, transcoding, filtering, etc. the audio content), converting the processed audio content to text using suitable speech recognition technique, and generating a transcript based on the text [0048]. Also see [0051], where a time-aligned transcript is generated.)
determining whether or not any of the sought-after search words are present in each of the plurality of component media streams in each streaming composite media file on a time segment-by-time segment basis (Liu – next, at 1204, process 1200 can search for video segments (i.e. time segment) that match the search query (i.e. sought-after search words). For example, process 1200 can access a database storing transcripts associated with video content and identify transcripts associated with video content and identify transcripts that match the search query. Process 1200 can then identify video segments that match the search query based on the identified transcripts. A video segment that matches a search query can be identified by searching for a portion of a transcript that contains one or more keywords associated with the search query, identifying one or more time stamps associated with the portion of the transcript, and identifying a segment of a video that corresponds to the portion of the transcript based on the timestamps [0084]. A segment of the video that matches the search query based on the time stamps associated with the portion of the transcript can be identified. Process 1300 can identify a first frame and a second frame of the video corresponding to a time stamp representative of a start time of the portion of the transcript and a time stamp representative of an end time of the portion of the transcript. The process can then identify a segment of the video including the first frame and the second frame as a video segment corresponding to the portion of the transcript. In a particular example, the boundaries of the segment of the video can be defined by the first frame and the second frame (i.e. time segment-by-time segment basis) [0108].)
in a time segment in which one or more of the sought-after search words are present, assigning a significance value to each search word match in that time segment based at least partly on a closeness of the match between the one or more sought-after search words and the corresponding one or more search words present in that time segment (Liu – process 1200 can calculate a matching score for each of the matching video segments that are identified at 1204 and can then select a set of matching video segments based on the matching scores [0088]. A matching score can be calculated based on any suitable criteria, including based on a relevancy score (i.e. significance value) indicative of a degree to which the video segment matches the search query. In a more particular example, a video segment that includes a greater number of search terms and/or keywords associated with the search query can be regarded as more relevant than a video segment that includes a fewer number of the search terms and/or keywords and can thus be assigned a higher relevance score [0089]. Also see matching score calculated based on a popularity score (i.e. significance value) and based on a recency score (i.e. significance value), where the matching score may be determined as a weighted sum of any combination of the relevancy score, popularity score, and recency score (i.e. significance values) [0090-0092].)
totaling the significance values assigned to the matches in each respective time segment to obtain a match intensity value for that time segment (Liu - process 1200 can calculate a matching score (i.e. match intensity value) for each of the matching video segments that are identified at 1204 and can then select a set of matching video segments based on the matching scores [0088]. A matching score can be calculated based on any suitable criteria, including based on a relevancy score, a popularity score, and a recency score. A matching score associated with a video segment can be a weighted sum (i.e. total), weighted average, and/or any other suitable combination of a relevancy score, a popularity score, a recency score, etc. associated with the video segment [0089-0092].)
storing the streaming composite media files having non-zero match density values in a separate set MATCHES; and sorting the streaming composite media files in set MATCHES by match intensities to obtain an ordered set of [Match Intensity, Location], where Location corresponds to a unique combination of one of the streaming composite media files in MATCHES and a time segment therein (Liu – in a more particular example, the subset of the video segments can be selected by ranking (i.e. sorting) the matching video segments by matching score and selecting multiple matching videos segments that are associated with the top N highest matching scores [0088]. Video database can include any suitable device that can store videos, metadata associated with each of the videos (e.g., a description of a video, a title of a video, a tag associated with a video, an author of a video, and/or any other suitable metadata associated with a video), and/or any other suitable video data [0045].)
Liu does not appear to teach:
wherein the time domain across which the plurality of component media streams extends is divided into a plurality of time segments of equal length
totaling the match intensity values over all of the time segments to obtain a match weight value for that streaming composite media file; dividing the match weight value by the number of time segments into which the time domain is divided to obtain a match density value for the corresponding streaming composite media file, the match density value being comparable against a match density value obtained for a different one of the streaming composite media files relative to the same sought-after one or more search words, wherein a comparatively higher match density value indicates a greater relevance to the sought-after one or more search words than a different streaming composite media file with a lower match density value
However, Shekhar teaches:
wherein the time domain across which the plurality of component media streams extends is divided into a plurality of time segments of equal length (Shekhar – in block 202, the process 200 involves accessing segments of an input video. the summarization engine partitions the input video into segments of equal length [Col. 6 lines 36-49].)
Accordingly, it would have been obvious to a person of ordinary skill in the art at the time the invention was effectively filed, having the teachings of Liu and Shekhar before them, to modify the system of Liu with the teachings of Shekhar as shown above. One would have been motivated to make such a modification to summarize video content in a way that distinguishes the video content (or its summary) from other videos competing for views’ attention (Shekhar - [Col. 1 lines 35-41]).
Liu modified by Shekhar does not appear to teach:
totaling the match intensity values over all of the time segments to obtain a match weight value for that streaming composite media file; dividing the match weight value by the number of time segments into which the time domain is divided to obtain a match density value for the corresponding streaming composite media file, the match density value being comparable against a match density value obtained for a different one of the streaming composite media files relative to the same sought-after one or more search words, wherein a comparatively higher match density value indicates a greater relevance to the sought-after one or more search words than a different streaming composite media file with a lower match density value
However, Suri teaches:
totaling the match intensity values over all of the time segments to obtain a match weight value for that streaming composite media file; dividing the match weight value by the number of time segments into which the time domain is divided to obtain a match density value for the corresponding streaming composite media file, the match density value being comparable against a match density value obtained for a different one of the streaming composite media files relative to the same sought-after one or more search words, wherein a comparatively higher match density value indicates a greater relevance to the sought-after one or more search words than a different streaming composite media file with a lower match density value (Suri – the scoring module may be used to determine a desirability score for video frames, video segments, video files, and/or video collections [0060]. The scoring module may determine a desirability score for a video segment, video file and/or a video collection by adding together the desirability scores (i.e. match weight value) associated with the video frames in the video segment, video file, and/or video collection, and finding an average desirability score (i.e. dividing) based on the number of video frames in the video segment, video file, and/or video collection [0065].)
Accordingly, it would have been obvious to a person of ordinary skill in the art at the time the invention was effectively filed, having the teachings of Liu, Shekhar and Suri before them, to modify the system of Liu and Shekhar with the teachings of Suri as shown above. One would have been motivated to make such a modification to identify desirable portions of video content, which may include audio data, in video data (Suri [0004, 0023]).
Regarding claim 8, Liu teaches:
wherein the plurality of component media streams includes a mix of data formats including one or more of video, audio, synchronized video and audio, and live computer screen capture streams (Liu – a video segment can include one or more video frames that correspond to one or more speech utterances (phrases, sentences, etc.), audio scenes, video scenes, and/or any other suitable portions of the video [0049]. A transcript can be generated by extracting audio content from the video [0048]. Also see [0051], where an internal video database can include one or more videos and metadata for each video, such as the title, the description entered by the video owner, and a number of format and locations where the video is available. Also, the audio signal is extract