DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Specification
The lengthy specification has not been checked to the extent necessary to determine the presence of all possible minor errors. Applicant’s cooperation is requested in correcting any errors of which applicant may become aware in the specification.
Claim Objections
Claim 1 is objected to because of the following informalities: Line 5 of claim 1 recites “determining, based on the plurality of shot boundaries, the frames into a plurality of shots;” which appears to contain a grammatical error, inconsistent claim terminology and/or minor informalities. The Examiner suggests amending line 5 of claim 1 to --dividing, based on the plurality of shot boundaries, the sequence of frames into a plurality of shots;-- in order to maintain consistency with line 2 of claim 1 and to improve the clarity and precision of the claim. Appropriate correction is required.
Claim 1 is objected to because of the following informalities: Lines 10 - 11 of claim 1 recite, in part, “plurality of shots; determining, based on the information” which appears to contain a grammatical error and/or a minor informality. The Examiner suggests amending the claim to --plurality of shots; and determining, based on the information-- in order to improve the clarity and precision of the claim. Appropriate correction is required.
Claim 1 is objected to because of the following informalities: Lines 11 - 12 of claim 1 recite, in part, “that one or more of the shot boundaries” which appears to contain inconsistent claim terminology and/or a minor informality. The Examiner suggests amending the claim to --that one or more of the plurality of shot boundaries-- in order to maintain consistency with lines 3 - 4 of claim 1 and to improve the clarity and precision of the claim. Appropriate correction is required.
Claim 7 is objected to because of the following informalities: Line 2 and line 4 of claim 7 recite, in part, “the different first models” which appears to contain inconsistent claim terminology and/or minor informalities. The Examiner suggests amending line 2 and line 4 of claim 7 to --the plurality of different first models-- in order to maintain consistency with line 1 of claim 7 and to improve the clarity and precision of the claim. Appropriate correction is required.
Claim 9 is objected to because of the following informalities: Line 2 of claim 9 recites, in part, “segments of the content item” which appears to contain inconsistent claim terminology and/or a minor informality. The Examiner suggests amending the claim to --segments of the primary content item-- in order to maintain consistency with line 2 of claim 1 and to improve the clarity and precision of the claim. Appropriate correction is required.
Claim 11 is objected to because of the following informalities: Lines 3 - 4 of claim 11 recite, in part, “determining the frames into a plurality of shots based on shot boundaries; generating information,” which appears to contain grammatical errors, inconsistent claim terminology and/or minor informalities. The Examiner suggests amending the claim to --dividing the sequence of frames into a plurality of shots based on shot boundaries; and generating information,-- in order to maintain consistency with line 2 of claim 11 and to improve the clarity and precision of the claim. Appropriate correction is required.
Claim 11 is objected to because of the following informalities: Line 5 of claim 11 recites, in part, “wherein each model pair comprises:” which appears to contain a grammatical error, inconsistent claim terminology and/or a minor informality. The Examiner suggests amending the claim to --wherein each model pair of the plurality of model pairs comprises:-- in order to improve the clarity and precision of the claim. Appropriate correction is required.
Claim 16 is objected to because of the following informalities: Line 4 of claim 16 recites, in part, “of the model pairs;” which appears to contain a grammatical error, inconsistent claim terminology and/or a minor informality. The Examiner suggests amending the claim to --of the plurality of model pairs;-- in order to improve the clarity and precision of the claim. Appropriate correction is required.
Claim 19 is objected to because of the following informalities: Line 2 and line 4 of claim 19 recite, in part, “the different self-attention models” which appears to contain inconsistent claim terminology and/or minor informalities. The Examiner suggests amending line 2 and line 4 of claim 19 to --the plurality of different self-attention models-- in order to maintain consistency with lines 1 - 2 of claim 19 and to improve the clarity and precision of the claim. Appropriate correction is required.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.
Claims 2 - 5 and 7 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA 35 U.S.C. 112, the applicant), regards as the invention.
Claim 2 is rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA 35 U.S.C. 112, the applicant), regards as the invention because it is unclear as to which shot “the shot” recited on line 1 is referencing at least because line 5 of claim 1, which recites, in part, “a plurality of shots;”, makes it clear that there are a plurality of shots. Therefore, the Examiner asserts that it is unclear as to which shot of the plurality of shots, if any, “the shot” recited on line 1 of claim 2 is referencing. Clarification and appropriate correction are required. For purposes of examination, the Examiner will treat the claim as referencing any shot of the plurality of shots.
Claim 3 is rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA 35 U.S.C. 112, the applicant), regards as the invention because it is unclear as to which shot “the shot” recited on line 1 is referencing at least because line 5 of claim 1, which recites, in part, “a plurality of shots;”, makes it clear that there are a plurality of shots. Therefore, the Examiner asserts that it is unclear as to which shot of the plurality of shots, if any, “the shot” recited on line 1 of claim 3 is referencing. Clarification and appropriate correction are required. For purposes of examination, the Examiner will treat the claim as referencing any shot of the plurality of shots.
Claim 4 is rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA 35 U.S.C. 112, the applicant), regards as the invention because it is unclear as to which shot “the shot” recited on line 1 is referencing at least because line 5 of claim 1, which recites, in part, “a plurality of shots;”, makes it clear that there are a plurality of shots. Therefore, the Examiner asserts that it is unclear as to which shot of the plurality of shots, if any, “the shot” recited on line 1 of claim 4 is referencing. Clarification and appropriate correction are required. For purposes of examination, the Examiner will treat the claim as referencing any shot of the plurality of shots.
Claim 5 is rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA 35 U.S.C. 112, the applicant), regards as the invention because it is unclear as to which shot “the shot” recited on line 1 is referencing at least because line 5 of claim 1, which recites, in part, “a plurality of shots;”, makes it clear that there are a plurality of shots. Therefore, the Examiner asserts that it is unclear as to which shot of the plurality of shots, if any, “the shot” recited on line 1 of claim 5 is referencing. Clarification and appropriate correction are required. For purposes of examination, the Examiner will treat the claim as referencing any shot of the plurality of shots.
Claim 7 is rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA 35 U.S.C. 112, the applicant), regards as the invention because it is unclear as to which shot “the shot,” recited on line 2 is referencing at least because line 5 of claim 1, which recites, in part, “a plurality of shots;”, makes it clear that there are a plurality of shots. Therefore, the Examiner asserts that it is unclear as to which shot of the plurality of shots, if any, “the shot” recited on line 2 of claim 7 is referencing. Clarification and appropriate correction are required. For purposes of examination, the Examiner will treat the claim as referencing any shot of the plurality of shots.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claims 1 - 9, 11 - 15 and 17 - 20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to a judicial exception, an abstract idea, without significantly more. The claims are directed towards generating information indicating inter-shot relationships between different shots of a primary content item and/or determining scene boundaries of the primary content item.
The claims recite, at a high level of generality, identifying, based on comparing images of sequential frames of the sequence of frames, a plurality of shot boundaries in the sequence of frames, determining [dividing], based on the plurality of shot boundaries, the frames into a plurality of shots, generating information indicating areas of attention in the frames of each of the plurality of shots, generating information indicating inter-shot relationships between frames of different shots of the plurality of shots, and determining, based on the information indicating inter-shot relationships, that one or more of the shot boundaries are scene boundaries in the primary content item.
The limitation of “identifying, based on comparing images of sequential frames of the sequence of frames, a plurality of shot boundaries in the sequence of frames”, as drafted, is a process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind using observation, evaluation, judgment, and opinion. For example, the claimed identifying, based on comparing images of sequential frames of the sequence of frames, a plurality of shot boundaries in the sequence of frames encompasses a user viewing a sequence of frames of a content item and performing an evaluation by mentally detecting (identifying) transitions (shot boundaries) between shots in the sequence of frames. If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, with or without the use of a physical aid such as pen and paper, then it falls within the “Mental Processes” grouping of abstract ideas. See MPEP § 2106.04(a)(2)(III).
Similarly, the limitation of “determining [dividing], based on the plurality of shot boundaries, the frames into a plurality of shots”, as drafted, is a process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind using observation, evaluation, judgment, and opinion. For example, the claimed determining [dividing], based on the plurality of shot boundaries, the frames into a plurality of shots encompasses a user viewing a sequence of frames of a content item and identifying which frames of the content item belong to each of one or more shots based on detected transitions (shot boundaries) between shots in the sequence of frames. If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, with or without the use of a physical aid such as pen and paper, then it falls within the “Mental Processes” grouping of abstract ideas. See MPEP § 2106.04(a)(2)(III).
Relatedly, the limitation of “generating, based on applying a first model to frames of each of the plurality of shots, information indicating areas of attention in the frames of each of the plurality of shots”, as drafted, is a process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind using observation, evaluation, judgment, and opinion. For example, the claimed generating, based on applying a first model to frames of each of the plurality of shots, information indicating areas of attention in the frames of each of the plurality of shots encompasses a user observing frames of each of the plurality of shots and identifying one or more regions of the frames that are of interest. If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, with or without the use of a physical aid such as pen and paper, then it falls within the “Mental Processes” grouping of abstract ideas. See MPEP § 2106.04(a)(2)(III).
Likewise, the limitation of “generating, based on applying a second model to the information indicating the areas of attention, information indicating inter-shot relationships between frames of different shots of the plurality of shots”, as drafted, is a process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind using observation, evaluation, judgment, and opinion. For example, the claimed generating, based on applying a second model to the information indicating the areas of attention, information indicating inter-shot relationships between frames of different shots of the plurality of shots encompasses a user observing the regions of the frames that they identified as being of interest for each of the shots and determining similar identified regions of interest that appear in different shots. If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, with or without the use of a physical aid such as pen and paper, then it falls within the “Mental Processes” grouping of abstract ideas. See MPEP § 2106.04(a)(2)(III).
Lastly, the limitations of “determining, based on the information indicating inter-shot relationships, that one or more of the shot boundaries are scene boundaries in the primary content item” and/or “sending the inter-shot information to a prediction model for identifying scene boundaries within the content item”, as drafted, are processes that, under their broadest reasonable interpretations, cover performance of the limitations in the mind using observation, evaluation, judgment, and opinion. For example, the claimed determining, based on the information indicating inter-shot relationships, that one or more of the shot boundaries are scene boundaries in the primary content item and the claimed sending the inter-shot information to a prediction model for identifying scene boundaries within the content item encompass a user evaluating the regions of interest identified in different shots that they determined to be similar to each other and deciding whether or not the different shots belong to the same scene or different scenes based on the regions of interest identified in the different shots. If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, with or without the use of a physical aid such as pen and paper, then it falls within the “Mental Processes” grouping of abstract ideas. See MPEP § 2106.04(a)(2)(III).
This judicial exception is not integrated into a practical application. In particular, the claims recite additional elements of: “receiving… a sequence of frames of a primary content item”, “receiving… intra-shot information indicating areas of attention in frames of each of a plurality of shots of a content item” and “a computing device”.
The limitations of “receiving… a sequence of frames of a primary content item” and “receiving… intra-shot information indicating areas of attention in frames of each of a plurality of shots of a content item” are mere pre-solution activity, data gathering, recited at a high level of generality, and thus are insignificant extra-solution activity. See MPEP § 2106.05(g). In addition, all uses of the recited judicial exception require such data gathering, and, as such, this limitations do not impose any meaningful limits on the claims. These limitations amount to necessary data gathering. See MPEP § 2106.05.
Further, the limitation of “a computing device” is recited at a high level of generality such that it amounts to no more than mere instructions to apply the exception using generic computer components. Furthermore, the claims as a whole merely describe how to generally “apply” the concept(s) of generating information indicating inter-shot relationships between different shots of a primary content item and/or determining scene boundaries of the primary content item in a computer environment. Simply implementing the abstract idea on a generic computer is not a practical application of the abstract idea. See MPEP § 2106.05(f).
Even when viewed in combination, these additional elements do not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. Accordingly, the claims are directed to an abstract idea.
The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional elements of: “receiving… a sequence of frames of a primary content item” and “receiving… intra-shot information indicating areas of attention in frames of each of a plurality of shots of a content item” are mere pre-solution activity, data gathering, recited at a high level of generality, and are thus insignificant extra-solution activity. Furthermore, the additional element of: “a computing device” amounts to no more than mere instructions to apply the exception using generic computer components. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. Even when considered in combination, these additional elements represent mere instructions to implement an abstract idea or other exception on a computer and insignificant extra-solution activity, which do not provide an inventive concept. The claims are not patent eligible.
Furthermore, with regards to dependent claims 18 and 19, the limitations of “applying a self-attention model to the frames of the content item”, “providing output from the self-attention model to a gated state space model”, “applying a plurality of different self-attention models to the frames of the content item, wherein the different self-attention models are configured to focus on different types of visual features” and “wherein each of the different self-attention models is configured to provide output to a corresponding gated state space model”, as drafted, are processes that, under their broadest reasonable interpretations, cover performance of the limitations in the mind but for the recitation of generic computer components, a self-attention model(s) and a gated state space model(s). That is, other than reciting a self-attention model(s) and a gated state space model(s), nothing in the claim elements preclude the steps from practically being performed in the mind. The Examiner asserts that the claims do not provide any details nor limit how the models operate or how their functions are performed, and the plain meanings of applying, providing and output[ting] encompass mental observations, evaluations, judgments, and/or opinions, e.g., a user observing frames of the content item and identifying one or more regions of interest. Under their broadest reasonable interpretations when read in light of the specification, applying, providing and output[ting] encompass mental processes practically performed in the human mind by observation(s), evaluation(s), judgment(s) and/or opinion(s). For example, the claimed applying, providing and output[ting] encompass a user observing frames of the content item and identifying one or more regions of interest. See MPEP § 2106.04(a)(2)(I) and § 2106.04(a)(2)(III). If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. The claims are not patent eligible.
Moreover, with regards to dependent claims 18 and 19, the limitations of “applying a self-attention model to the frames of the content item”, “providing output from the self-attention model to a gated state space model”, “applying a plurality of different self-attention models to the frames of the content item, wherein the different self-attention models are configured to focus on different types of visual features” and “wherein each of the different self-attention models is configured to provide output to a corresponding gated state space model”, provide nothing more than mere instructions to implement an abstract idea on a generic computer. See MPEP § 2106.05(f). MPEP 2106.05(f) provides the following considerations for determining whether a claim simply recites a judicial exception with the words “apply it” (or an equivalent), such as mere instructions to implement an abstract idea on a computer: (1) whether the claim recites only the idea of a solution or outcome i.e., the claim fails to recite details of how a solution to a problem is accomplished; (2) whether the claim invokes computers or other machinery merely as a tool to perform an existing process; and (3) the particularity or generality of the application of the judicial exception. Moreover, the aforementioned models are used to generally apply the abstract idea without placing any limits on how the aforementioned models function. See MPEP 2106.05(f). Additionally, the recitations of a self-attention model(s) and a gated state space model(s)” merely indicate a field of use or technological environment in which the judicial exception is performed. Although the additional elements of a self-attention model(s) and a gated state space model(s) limit the identified judicial exception generating information indicating inter-shot relationships between different shots of a primary content item and/or determining scene boundaries of the primary content item, these types of limitations merely confine the use of the abstract idea to a particular technological environment (machine learning) and thus fail to add an inventive concept to the claims. See MPEP 2106.05(h). The claims are not patent eligible.
In addition, with regards to dependent claims 2 - 9, 12 - 15 and 18 - 20, the Examiner asserts that claims 2 - 9, 12 - 15 and 18 - 20 are also directed to the abstract idea(s) of generating information indicating inter-shot relationships between different shots of a primary content item and/or determining scene boundaries of the primary content item and dependent claims 2 - 9, 12 - 15 and 18 - 20 merely further limit the abstract idea(s) claimed in independent claims 1, 11 and 17, for example by further identifying visual features to consider when generating the information indicating areas of attention and/or inter-shot information and/or by identifying outputting recited at a high level of generality corresponding to insignificant extra-solution activity. However, the Examiner asserts that a more detailed abstract idea remains an abstract idea and that none of the limitations of dependent claims 2 - 9, 12 - 15 and 18 - 20 considered as an ordered combination provide eligibility because taken as a whole the claims merely instruct the practitioner to apply the abstract idea using generic computer components. The claims are not eligible.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1, 7 - 12 and 14 - 16 are rejected under 35 U.S.C. 103 as being unpatentable over Mun et al., “Boundary-Aware Self-Supervised Learning For Video Scene Segmentation”, arXiv, arXiv:2201.05277v1, 14 Jan. 2022, pages 1 - 20, herein referred to as “Mun et al.”, in view of Chu et al. U.S. Publication No. 2019/0147105 A1.
- With regards to claim 1, Mun et al. disclose a method (Mun et al., Pg. 1 Abstract, Pg. 3 § 3 - Pg. 4 § 3.2, Pg. 10 § 5) comprising: receiving a sequence of frames of a primary content item; (Mun et al., Pg. 1 Abstract, Pg. 1 § 1 - Pg. 2 Third-Full Paragraph, Pg. 3 § 3.1 - Pg. 4 § 3.2, Pg. 7 § 4.1) identifying a plurality of shot boundaries in the sequence of frames; (Mun et al., Pg. 1 Abstract, Pg. 1 § 1 - Pg. 2 Third-Full Paragraph, Pg. 3 § 3.1 - Pg. 4 § 3.2, Pg. 7 § 4.1) determining the frames into a plurality of shots; (Mun et al., Pg. 1 Abstract, Pg. 1 § 1 - Pg. 2 Third-Full Paragraph, Pg. 3 § 3.1 - Pg. 5 § 3.3, Pg. 7 § 3.5 - § 4.1) generating, based on applying a first model to frames of each of the plurality of shots, information indicating areas of attention in the frames of each of the plurality of shots; (Mun et al., Pgs. 4 - 5 § 3.2, Pg. 4 Figs. 2 & 3, Pg. 6 § 3.4, Pgs. 7 - 8 § 4.1, Pg. 14 § A) generating, based on applying a second model to the information indicating the areas of attention, information indicating inter-shot relationships between frames of different shots of the plurality of shots; (Mun et al., Pg. 3 § 3.1 - Pg. 4 § 3.2, Pg. 4 Figs. 2 & 3, Pg. 6 § 3.4 - Pg. 7 § 4.1) determining, based on the information indicating inter-shot relationships, that one or more of the shot boundaries are scene boundaries in the primary content item. (Mun et al., Pg. 1 Abstract, Pgs. 1 - 2 § 1, Pg. 3 § 3 - Pg. 4 § 3.2, Pg. 6 § 3.4 - Pg. 7 § 3.5) Mun et al. fail to disclose explicitly a computing device; and identifying, based on comparing images of sequential frames of the sequence of frames, a plurality of shot boundaries. Pertaining to analogous art, Chu et al. disclose a method (Chu et al., Abstract) comprising: receiving, by a computing device, (Chu et al., Abstract, Fig. 1A, Pg. 1 ¶ 0022 - 0023, Pg. 2 ¶ 0038 - Pg. 3 ¶ 0040, Pg. 3 ¶ 0048, Pg. 4 ¶ 0056, Pg. 5 ¶ 0068 - Pg. 6 ¶ 0075) a sequence of frames of a primary content item; (Chu et al., Pg. 1 ¶ 0012, Pg. 2 ¶ 0035, Pg. 2 ¶ 0039 - Pg. 3 ¶ 0040, Pg. 3 ¶ 0042 - 0046 and 0049) identifying, based on comparing images of sequential frames of the sequence of frames, a plurality of shot boundaries in the sequence of frames; (Chu et al., Figs. 1A - 2, Pg. 1 ¶ 0005 and 0008 - 0009, Pg. 3 ¶ 0042 and 0049) determining, based on the plurality of shot boundaries, the frames into a plurality of shots; (Chu et al., Figs. 1A - 2, Pg. 1 ¶ 0005 and 0008 - 0009, Pg. 3 ¶ 0042 and 0049) generating, based on applying a first model to frames of each of the plurality of shots, information indicating areas of attention in the frames of each of the plurality of shots; (Chu et al., Abstract, Figs. 1A, 2 & 3, Pg. 1 ¶ 0010 - 0015 and 0019, Pg. 4 ¶ 0057 - Pg. 5 ¶ 0062) generating, based on applying a second model to the information indicating the areas of attention, information indicating inter-shot relationships between frames of different shots of the plurality of shots; (Chu et al., Abstract, Figs. 1A & 2, Pg. 1 ¶ 0005 and 0020 - 0021, Pg. 2 ¶ 0026 and 0036, Pg. 3 ¶ 0044 - 0046, Pg. 4 ¶ 0051 - 0055, Pg. 5 ¶ 0066 - 0067) determining, based on the information indicating inter-shot relationships, that one or more of the shot boundaries are scene boundaries in the primary content item. (Chu et al., Abstract, Figs. 1A & 2, Pg. 1 ¶ 0005 and 0020 - 0021, Pg. 2 ¶ 0026 and 0036, Pg. 3 ¶ 0044 - 0046, Pg. 4 ¶ 0051 - 0055, Pg. 5 ¶ 0066 - 0067) Mun et al. and Chu et al. are combinable because they are both directed towards image processing methods for video scene segmentation. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Mun et al. with the teachings of Chu et al. A first modification would have been prompted in order to enhance the base device of Mun et al. with the well-known and applicable technique Chu et al. applied to a comparable device. Utilizing a computing device to implement a method, as taught by Chu et al., would enhance the base device of Mun et al. by allowing for it to be implemented accurately and efficiently at high computational speed on computer architecture. Additionally, a second modification would have been prompted in order to enhance the base device of Mun et al. with the well-known and applicable technique Chu et al. applied to a comparable device. Comparing images of sequential frames of the sequence of frames to identify a plurality of shot boundaries, as taught by Chu et al., would enhance the base device of Mun et al. by allowing for it to be utilized on videos that have not been previously partitioned into a plurality of shots so as to enable the base device of Mun et al. to be applied to an increased number of videos and thereby enhancing the overall appeal and usefulness of the base device of Mun et al. to potential end-users. This combination could be completed according to well-known techniques in the art and would likely yield predictable results, in that a computing device would be utilized to perform the operations of the base device of Mun et al. so as to ensure that its operations are carried out accurately, efficiently and at high computational speed on computer architecture and in that images of sequential frames of a sequence of frames of a primary content item would be compared to identify a plurality of shot boundaries so as allow for the base device of Mun et al. to be utilized on an increased number of videos, such as videos that have not been previously partitioned into a plurality of shots, and thereby enhance the overall appeal and usefulness of the base device of Mun et al. to potential end-users. Therefore, it would have been obvious to combine Mun et al. with Chu et al. to obtain the invention as specified in claim 1.
- With regards to claim 7, Mun et al. in view of Chu et al. disclose the method of claim 1, further comprising applying a plurality of different first models to the frames of the shot, (Mun et al., Pg. 1 Abstract, Pgs. 4 - 5 § 3.2, Pg. 4 Fig. 3, Pgs. 7 - 8 § 4.1) wherein the different first models are configured to focus on different types of visual features; (Mun et al., Pg. 1 Abstract, Pgs. 4 - 5 § 3.2, Pg. 4 Fig. 3, Pgs. 7 - 8 § 4.1) and wherein each of the different first models is configured to provide output to a corresponding second model. (Mun et al., Pg. 1 Abstract, Pg. 4 Fig. 3, Pgs. 6 - 7 § 3.4) In addition, analogous art Chu et al. disclose applying a plurality of different first models to the frames of the shot, (Chu et al., Pg. 1 ¶ 0005 and 0010 - 0016, Pg. 4 ¶ 0058 - Pg. 5 ¶ 0062) wherein the different first models are configured to focus on different types of visual features; (Chu et al., Pg. 1 ¶ 0005 and 0010 - 0016, Pg. 4 ¶ 0058 - Pg. 5 ¶ 0062) and wherein each of the different first models is configured to provide output to a corresponding second model. (Chu et al., Abstract, Figs. 1A & 2, Pg. 1 ¶ 0005 and 0020 - 0021, Pg. 2 ¶ 0026 and 0036, Pg. 3 ¶ 0044 - 0046, Pg. 4 ¶ 0051 - 0055, Pg. 5 ¶ 0066 - 0067)
- With regards to claim 8, Mun et al. in view of Chu et al. disclose the method of claim 1, further comprising: using a first pair of a first model and a corresponding second model to focus on spatio-temporal patterns; (Mun et al., Pg. 1 Abstract, Pg. 4 § 3.2 - Pg. 6 § 3.4, Pg. 4 Fig. 3, Pgs. 7 - 8 § 4.1) and using a second pair of a first model and a corresponding second model to focus on contextualized features. (Mun et al., Pg. 1 Abstract, Pgs. 4 - 5 § 3.2, Pg. 4 Fig. 3, Pg. 6 § 3.4, Pgs. 7 - 8 § 4.1) Mun et al. fail to disclose explicitly using a first pair of models to focus on faces; and using a second pair of models to focus on objects. Pertaining to analogous art, Chu et al. disclose using a first pair of a first model and a corresponding second model to focus on faces; (Chu et al., Abstract, Figs. 1A & 2, Pg. 1 ¶ 0005 and 0019 - 0021, Pg. 2 ¶ 0026 and 0036, Pg. 3 ¶ 0044 - 0046, Pg. 4 ¶ 0051 - 0055, Pg. 5 ¶ 0061 and 0066 - 0067) and using a second pair of a first model and a corresponding second model to focus on objects. (Chu et al., Abstract, Figs. 1A & 2, Pg. 1 ¶ 0005, 0015 and 0020 - 0021, Pg. 2 ¶ 0026 and 0036, Pg. 3 ¶ 0044 - 0046, Pg. 4 ¶ 0051 - 0055 and 0058, Pg. 5 ¶ 0061 and 0066 - 0067) It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the combined teachings of Mun et al. in view of Chu et al. with additional teachings of Chu et al. This modification would have been prompted in order to enhance the combined base device of Mun et al. in view of Chu et al. with the well-known and applicable technique Chu et al. applied to a comparable device. Utilizing different models to focus on faces and objects, respectively, as taught by Chu et al., would enhance the combined base device by improving its ability to identify and analyze various semantic content of interest in a video when attempting to locate moments with a maximum semantic transition. Furthermore, this modification would have been prompted by the teachings and suggestions of Mun et al. that their method attempts to divide a sequence so that semantics, such as places and characters, maximally changes, see at least page 2 figure 1, page 5 section 3.3 and page 5 figure 4 of Mun et al. This combination could be completed according to well-known techniques in the art and would likely yield predictable results, in that different models focusing on faces and objects, respectively, would be utilized to improve the ability of the combined base device to locate moments in a video wherein semantics maximally change. Therefore, it would have been obvious to combine Mun et al. in view of Chu et al. with additional teachings of Chu et al. to obtain the invention as specified in claim 8.
- With regards to claim 9, Mun et al. in view of Chu et al. disclose the method of claim 1, further comprising using the scene boundaries to generate different video segments of the content item. (Mun et al., Pg. 1 Abstract, Pg. 1 § 1 ¶ 1, Pg. 5 § 3.3, Pg. 5 Fig. 4, Pg. 7 § 3.5, Pg. 10 § 5, Pg. 20 Fig. 8) In addition, analogous art Chu et al. disclose using the scene boundaries to generate different video segments of the content item. (Chu et al., Abstract, Figs. 1A - 2, Pg. 1 ¶ 0005 - 0006 and 0020 - 0021, Pg. 2 ¶ 0024 - 0026, 0035 - 0036 and 0039, Pg. 3 ¶ 0044 - 0047, Pg. 4 ¶ 0052 - 0055, Pg. 5 ¶ 0065 - 0067)
- With regards to claim 10, Mun et al. in view of Chu et al. disclose the method of claim 1. Mun et al. fail to disclose explicitly adding a secondary content item to the primary content item at a location that is based on one of the scene boundaries; and causing transmission of a modified primary content item comprising the added secondary content item. Pertaining to analogous art, Chu et al. disclose adding a secondary content item to the primary content item at a location that is based on one of the scene boundaries; (Chu et al., Pg. 2 ¶ 0024) and a modified primary content item comprising the added secondary content item. (Chu et al., Pg. 2 ¶ 0024) Chu et al. fail to disclose explicitly causing transmission of the modified primary content item. However, the Examiner takes official notice of the fact that causing transmission of a modified primary content item is notoriously well-known in the art. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the combined teachings of Mun et al. in view of Chu et al. with additional teachings of Chu et al. This modification would have been prompted in order to enhance the combined base device of Mun et al. in view of Chu et al. with the well-known and applicable technique Chu et al. applied to a comparable device. Adding a secondary content item to the primary content item at a location that is based on one of the scene boundaries to produce a modified primary content item, as taught by Chu et al., would enhance the combined base device by allowing for it to utilize the scene boundaries it determined in a number of related video processing functions and/or applications, such as video advertisement and/or information insertion applications, so as to expand the number and variety of applications in which the combined base device may be utilized and increase its overall appeal and usefulness to potential end-users. This combination could be completed according to well-known techniques in the art and would likely yield predictable results, in that a secondary content item would be added to the primary content item at a location that is based on one of the scene boundaries to produce a modified primary content item so as to expand the number and variety of applications in which the combined base device may be utilized and increase its overall appeal and usefulness to potential end-users. In addition, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the combined teachings of Mun et al. in view of Chu et al. to include causing transmission of the modified primary content item. This modification would have been prompted in order to enhance the ability of the combined base device of Mun et al. in view of Chu et al. with the notoriously well-known technique of transmitting a modified primary content item. Causing transmission of the modified primary content item would enhance the combined base device by enabling the modified primary to be easily and conveniently distributed to potential end-users for viewing thereby improving the overall appeal, usefulness and usability of the combined base device. This combination could be completed according to well-known techniques in the art and would likely yield predictable results, in that the combined base device would be caused to transmit the modified primary content item so as to enable the modified primary to be easily and conveniently distributed to potential end-users for viewing. Therefore, it would have been obvious to combine Mun et al. in view of Chu et al. with additional teachings of Chu et al. and the notoriously well-known technique of causing transmission of a primary content item to obtain he invention as specified in claim 10.
- With regards to claim 11, Mun et al. disclose a method (Mun et al., Pg. 1 Abstract, Pg. 3 § 3 - Pg. 4 § 3.2, Pg. 10 § 5) comprising: receiving a sequence of frames of a primary content item; (Mun et al., Pg. 1 Abstract, Pg. 1 § 1 - Pg. 2 Third-Full Paragraph, Pg. 3 § 3.1 - Pg. 4 § 3.2, Pg. 7 § 4.1) generating information, based on applying a plurality of model pairs to frames of each of the plurality of shots, (Mun et al., Pgs. 4 - 5 § 3.2, Pg. 4 Figs. 2 & 3, Pg. 6 § 3.4, Pgs. 7 - 8 § 4.1, Pg. 14 § A) wherein each model pair comprises: a first model configured to identify areas of attention within a frame; (Mun et al., Pgs. 4 - 5 § 3.2, Pg. 4 Figs. 2 & 3, Pg. 6 § 3.4, Pgs. 7 - 8 § 4.1, Pg. 14 § A) and a second model configured to determine, based on the areas of attention, inter-shot relationships between frames of different shots. (Mun et al., Pg. 3 § 3.1 - Pg. 4 § 3.2, Pg. 4 Figs. 2 & 3, Pg. 6 § 3.4 - Pg. 7 § 4.1) Mun et al. fail to disclose explicitly a computing device. Pertaining to analogous art, Chu et al. disclose a method (Chu et al., Abstract) comprising: receiving, by a computing device, (Chu et al., Abstract, Fig. 1A, Pg. 1 ¶ 0022 - 0023, Pg. 2 ¶ 0038 - Pg. 3 ¶ 0040, Pg. 3 ¶ 0048, Pg. 4 ¶ 0056, Pg. 5 ¶ 0068 - Pg. 6 ¶ 0075) a sequence of frames of a primary content item; (Chu et al., Pg. 1 ¶ 0012, Pg. 2 ¶ 0035, Pg. 2 ¶ 0039 - Pg. 3 ¶ 0040, Pg. 3 ¶ 0042 - 0046 and 0049) determining the frames into a plurality of shots based on shot boundaries; (Chu et al., Figs. 1A - 2, Pg. 1 ¶ 0005 and 0008 - 0009, Pg. 3 ¶ 0042 and 0049) generating information, based on applying a plurality of model pairs to frames of each of the plurality of shots, (Chu et al., Abstract, Figs. 1A & 2, Pg. 1 ¶ 0005, 0015 and 0019 - 0021, Pg. 2 ¶ 0026 and 0036, Pg. 3 ¶ 0044 - 0046, Pg. 4 ¶ 0051 - 0055 and 0058 - 0059, Pg. 5 ¶ 0061 and 0066 - 0067) wherein each model pair comprises: a first model configured to identify areas of attention within a frame; (Chu et al., Abstract, Figs. 1A, 2 & 3, Pg. 1 ¶ 0010 - 0015 and 0019, Pg. 4 ¶ 0057 - Pg. 5 ¶ 0062) and a second model configured to determine, based on the areas of attention, inter-shot relationships between frames of different shots. (Chu et al., Abstract, Figs. 1A & 2, Pg. 1 ¶ 0005 and 0020 - 0021, Pg. 2 ¶ 0026 and 0036, Pg. 3 ¶ 0044 - 0046, Pg. 4 ¶ 0051 - 0055, Pg. 5 ¶ 0066 - 0067) Mun et al. and Chu et al. are combinable because they are both directed towards image processing methods for video scene segmentation. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Mun et al. with the teachings of Chu et al. This modification would have been prompted in order to enhance the base device of Mun et al. with the well-known and applicable technique Chu et al. applied to a comparable device. Utilizing a computing device to implement a method, as taught by Chu et al., would enhance the base device of Mun et al. by allowing for it to be implemented accurately and efficiently at high computational speed on computer architecture. This combination could be completed according to well-known techniques in the art and would likely yield predictable results, in that a computing device would be utilized to perform the operations of the base device of Mun et al. so as to ensure that its operations are carried out accurately, efficiently and at high computational speed on computer architecture. Therefore, it would have been obvious to combine Mun et al. with Chu et al. to obtain the invention as specified in claim 11.
- With regards to claim 12, Mun et al. in view of Chu et al. disclose the method of claim 11. Mun et al. fail to disclose explicitly wherein the first model is configured to identify areas of attention among a plurality of patches divided from the frame. Pertaining to analogous art, Chu et al. disclose wherein the first model is configured to identify areas of attention among a plurality of patches divided from the frame. (Chu et al., Pg. 1 ¶ 0019, Pg. 5 ¶ 0060 - 0062) It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the combined teachings of Mun et al. in view of Chu et al. with additional teachings of Chu et al. This modification would have been prompted in order to enhance the combined base device of Mun et al. in view of Chu et al. with the well-known and applicable technique Chu et al. applied to a comparable device. Identifying areas of attention among a plurality of patches divided from the frame, as taught by Chu et al., would enhance the combined base device by improving its ability to accurately and reliably identify similarities and differences between different shots of video since, in addition to the semantic content of the frames, the locations of the semantic content within the frames would be analyzed when evaluating the different shots for scene boundaries. This combination could be completed according to well-known techniques in the art and would likely yield predictable results, in that areas of attention would be identified among a plurality of patches divided from the frame so as to improve the ability of the combined base device to accurately and reliably identify similarities and differences between different shots of video since, in addition to the semantic content of the frames, the locations of the semantic content within the frames would be analyzed when evaluating the different shots for scene boundaries. Therefore, it would have been obvious to combine Mun et al. in view of Chu et al. with additional teachings of Chu et al. to obtain the invention as specified in claim 12.
- With regards to claim 14, Mun et al. in view of Chu et al. disclose the method of claim 11, wherein the plurality of model pairs are configured to focus on different types of visual features. (Mun et al., Pg. 1 Abstract, Pg. 4 § 3.2 - Pg. 6 § 3.4, Pg. 4 Fig. 3, Pgs. 7 - 8 § 4.1) In addition, analogous art Chu et al. disclose wherein the plurality of model pairs are configured to focus on different types of visual features. (Chu et al., Abstract, Figs. 1A & 2, Pg. 1 ¶ 0005, 0015 and 0019 - 0021, Pg. 2 ¶ 0026 and 0036, Pg. 3 ¶ 0044 - 0046, Pg. 4 ¶ 0051 - 0055 and 0058 - 0059, Pg. 5 ¶ 0061 and 0066 - 0067)
- With regards to claim 15, Mun et al. in view of Chu et al. disclose the method of claim 11, further comprising: using a first model pair to focus on spatio-temporal patterns; (Mun et al., Pg. 1 Abstract, Pg. 4 § 3.2 - Pg. 6 § 3.4, Pg. 4 Fig. 3, Pgs. 7 - 8 § 4.1) and using a second model pair to focus on contextualized features. (Mun et al., Pg. 1 Abstract, Pgs. 4 - 5 § 3.2, Pg. 4 Fig. 3, Pg. 6 § 3.4, Pgs. 7 - 8 § 4.1) Mun et al. fail to disclose explicitly using a first model pair to focus on faces; and using a second model pair to focus on objects. Pertaining to analogous art, Chu et al. disclose using a first model pair to focus on faces; (Chu et al., Abstract, Figs. 1A & 2, Pg. 1 ¶ 0005 and 0019 - 0021, Pg. 2 ¶ 0026 and 0036, Pg. 3 ¶ 0044 - 0046, Pg. 4 ¶ 0051 - 0055, Pg. 5 ¶ 0061 and 0066 - 0067) and using a second model pair to focus on objects. (Chu et al., Abstract, Figs. 1A & 2, Pg. 1 ¶ 0005, 0015 and 0020 - 0021, Pg. 2 ¶ 0026 and 0036, Pg. 3 ¶ 0044 - 0046, Pg. 4 ¶ 0051 - 0055 and 0058, Pg. 5 ¶ 0061 and 0066 - 0067) It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the combined teachings of Mun et al. in view of Chu et al. with additional teachings of Chu et al. This modification would have been prompted in order to enhance the combined base device of Mun et al. in view of Chu et al. with the well-known and applicable technique Chu et al. applied to a comparable device. Utilizing different models to focus on faces and objects, respectively, as taught by Chu et al., would enhance the combined base device by improving its ability to identify and analyze various semantic content of interest in a video when attempting to locate moments with a maximum semantic transition. Furthermore, this modification would have been prompted by the teachings and suggestions of Mun et al. that their method attempts to divide a sequence so that semantics, such as places and characters, maximally changes, see at least page 2 figure 1, page 5 section 3.3 and page 5 figure 4 of Mun et al. This combination could be completed according to well-known techniques in the art and would likely yield predictable results, in that different models focusing on faces and objects, respectively, would be utilized to improve the ability of the combined base device to locate moments in a video wherein semantics maximally change. Therefore, it would have been obvious to combine Mun et al. in view of Chu et al. with additional teachings of Chu et al. to obtain the invention as specified in claim 15.
- With regards to claim 16, Mun et al. in view of Chu et al. disclose the method of claim 11. Mun et al. fail to disclose explicitly adding a secondary content item to the primary content item at a location that is based on a scene boundary that is determined based on the inter-shot relationships determined by second models of the model pairs; and causing transmission of a modified primary content item comprising the added secondary content item. Pertaining to analogous art, Chu et al. disclose adding a secondary content item to the primary content item at a location that is based on a scene boundary that is determined based on the inter-shot relationships determined by second models of the model pairs; (Chu et al., Pg. 2 ¶ 0024) and a modified primary content item comprising the added secondary content item. (Chu et al., Pg. 2 ¶ 0024) Chu et al. fail to disclose explicitly causing transmission of the modified primary content item. However, the Examiner takes official notice of the fact that causing transmission of a modified primary content item is notoriously well-known in the art. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the combined teachings of Mun et al. in view of Chu et al. with additional teachings of Chu et al. This modification would have been prompted in order to enhance the combined base device of Mun et al. in view of Chu et al. with the well-known and applicable technique Chu et al. applied to a comparable device. Adding a secondary content item to the primary content item at a location that is based on one of the scene boundaries to produce a modified primary content item, as taught by Chu et al., would enhance the combined base device by allowing for it to utilize the scene boundaries it determined in a number of related video processing functions and/or applications, such as video advertisement and/or information insertion applications, so as to expand the number and variety of applications in which the combined base device may be utilized and increase its overall appeal and usefulness to potential end-users. This combination could be completed according to well-known techniques in the art and would likely yield predictable results, in that a secondary content item would be added to the primary content item at a location that is based on one of the scene boundaries to produce a modified primary content item so as to expand the number and variety of applications in which the combined base device may be utilized and increase its overall appeal and usefulness to potential end-users. In addition, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the combined teachings of Mun et al. in view of Chu et al. to include causing transmission of the modified primary content item. This modification would have been prompted in order to enhance the ability of the combined base device of Mun et al. in view of Chu et al. with the notoriously well-known technique of transmitting a modified primary content item. Causing transmission of the modified primary content item would enhance the combined base device by enabling the modified primary to be easily and conveniently distributed to potential end-users for viewing thereby improving the overall appeal, usefulness and usability of the combined base device. This combination could be completed according to well-known techniques in the art and would likely yield predictable results, in that the combined base device would be caused to transmit the modified primary content item so as to enable the modified primary to be easily and conveniently distributed to potential end-users for viewing. Therefore, it would have been obvious to combine Mun et al. in view of Chu et al. with additional teachings of Chu et al. and the notoriously well-known technique of causing transmission of a primary content item to obtain he invention as specified in claim 16.
Claims 2 - 5 are rejected under 35 U.S.C. 103 as being unpatentable over Mun et al., “Boundary-Aware Self-Supervised Learning For Video Scene Segmentation”, arXiv, arXiv:2201.05277v1, 14 Jan. 2022, pages 1 - 20, herein referred to as “Mun et al.”, in view of Chu et al. U.S. Publication No. 2019/0147105 A1 as applied to claim 1 above, and further in view of Li et al. U.S. Patent No. 12,008,788.
- With regards to claim 2, Mun et al. in view of Chu et al. disclose the method of claim 1, wherein the applying the first model to frames of the shot comprises: identifying visual similarities. (Mun et al., Pg. 1 Abstract, Pgs. 4 - 5 § 3.2, Pg. 6 § 3.4, Pgs. 7 - 8 § 4.1) Mun et al. fail to disclose explicitly dividing a frame into a plurality of patches; and identifying visual similarities between the patches of the frame. Pertaining to analogous art, Li et al. disclose dividing a frame into a plurality of patches; (Li et al., Abstract, Col. 2 Lines 31 - 64, Col. 5 Lines 22 - 51, Col. 9 Lines 55 - 67) and identifying visual similarities between the patches of the frame. (Li et al., Col. 3 Lines 29 - 65, Col. 5 Lines 52 - 63, Col. 8 Line 44 - Col. 9 Line 7, Col. 10 Lines 10 - 42, Col. 11 Line 27 - Col. 12 Line 11) Mun et al. in view of Chu et al. and Li et al. are combinable because they are all directed towards image processing methods for video scene understanding applications. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the combined teachings of Mun et al. in view of Chu et al. with the teachings of Li et al. This modification would have been prompted in order to substitute the first model of Mun et al. for the vision transformer model of Li et al. The vision transformer model of Li et al. could be substituted in place of the first model of Mun et al. utilizing well-known techniques in the art and would likely yield predictable results, in that in the combination the video transformer model of Li et al. would be utilized to generate the representations of the shots. This combination could be completed according to well-known techniques in the art and would likely yield predictable results, in that the video transformer model of Li et al. would be utilized to generate the representations of the shots. Therefore, it would have been obvious to combine Mun et al. in view of Chu et al. with Li et al. to obtain the invention as specified in claim 2.
- With regards to claim 3, Mun et al. in view of Chu et al. disclose the method of claim 1, wherein the applying the first model to frames of the shot comprises: identifying visual similarities. (Mun et al., Pg. 1 Abstract, Pgs. 4 - 5 § 3.2, Pg. 6 § 3.4, Pgs. 7 - 8 § 4.1) Mun et al. fail to disclose explicitly dividing a frame into a plurality of patches; and identifying positional relationships between patches, of the frame, that comprise visual similarities. Pertaining to analogous art, Li et al. disclose dividing a frame into a plurality of patches; (Li et al., Abstract, Col. 2 Lines 31 - 64, Col. 5 Lines 22 - 51, Col. 9 Lines 55 - 67) and identifying positional relationships between patches, of the frame, that comprise visual similarities. (Li et al., Col. 3 Lines 29 - 65, Col. 5 Lines 52 - 63, Col. 6 Lines 54 - 63, Col. 7 Lines 50 - 63, Col. 8 Line 44 - Col. 9 Line 7, Col. 10 Lines 10 - 42, Col. 11 Line 27 - Col. 12 Line 11) Mun et al. in view of Chu et al. and Li et al. are combinable because they are all directed towards image processing methods for video scene understanding applications. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the combined teachings of Mun et al. in view of Chu et al. with the teachings of Li et al. This modification would have been prompted in order to substitute the first model of Mun et al. for the vision transformer model of Li et al. The vision transformer model of Li et al. could be substituted in place of the first model of Mun et al. utilizing well-known techniques in the art and would likely yield predictable results, in that in the combination the video transformer model of Li et al. would be utilized to generate the representations of the shots. This combination could be completed according to well-known techniques in the art and would likely yield predictable results, in that the video transformer model of Li et al. would be utilized to generate the representations of the shots. Therefore, it would have been obvious to combine Mun et al. in view of Chu et al. with Li et al. to obtain the invention as specified in claim 3.
- With regards to claim 4, Mun et al. in view of Chu et al. disclose the method of claim 1, wherein the applying the first model to frames of the shot comprises: identifying visual similarities between different frames in the first shot. (Mun et al., Pg. 1 Abstract, Pgs. 4 - 5 § 3.2, Pg. 6 § 3.4, Pgs. 7 - 8 § 4.1) Mun et al. fail to disclose explicitly dividing each frame, of a plurality of frames in a first shot, into a plurality of patches; and identifying visual similarities between patches of different frames. Pertaining to analogous art, Li et al. disclose dividing each frame, of a plurality of frames in a first shot, into a plurality of patches; (Li et al., Abstract, Col. 2 Lines 31 - 64, Col. 3 Lines 29 - 51, Col. 5 Lines 22 - 51, Col. 9 Lines 55 - 67) and identifying visual similarities between patches of different frames in the first shot. (Li et al., Col. 3 Lines 29 - 65, Col. 5 Lines 52 - 63, Col. 6 Lines 54 - 63, Col. 7 Lines 50 - 63, Col. 8 Line 44 - Col. 9 Line 7, Col. 10 Lines 10 - 42, Col. 11 Line 27 - Col. 12 Line 11) Mun et al. in view of Chu et al. and Li et al. are combinable because they are all directed towards image processing methods for video scene understanding applications. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the combined teachings of Mun et al. in view of Chu et al. with the teachings of Li et al. This modification would have been prompted in order to substitute the first model of Mun et al. for the vision transformer model of Li et al. The vision transformer model of Li et al. could be substituted in place of the first model of Mun et al. utilizing well-known techniques in the art and would likely yield predictable results, in that in the combination the video transformer model of Li et al. would be utilized to generate the representations of the shots. This combination could be completed according to well-known techniques in the art and would likely yield predictable results, in that the video transformer model of Li et al. would be utilized to generate the representations of the shots. Therefore, it would have been obvious to combine Mun et al. in view of Chu et al. with Li et al. to obtain the invention as specified in claim 4.
- With regards to claim 5, Mun et al. in view of Chu et al. disclose the method of claim 1, wherein the applying the first model to frames of the shot comprises: identifying representations that: are visually similar; (Mun et al., Pg. 1 Abstract, Pgs. 4 - 5 § 3.2, Pg. 6 § 3.4, Pgs. 7 - 8 § 4.1) and are of different frames in the first shot. (Mun et al., Pg. 1 Abstract, Pgs. 4 - 5 § 3.2, Pg. 6 § 3.4, Pgs. 7 - 8 § 4.1) Mun et al. fail to disclose explicitly dividing each frame, of a plurality of frames in a first shot, into a plurality of patches; and identifying positional relationships between patches that: are visually similar; and are of different frames in the first shot. Pertaining to analogous art, Li et al. disclose dividing each frame, of a plurality of frames in a first shot, into a plurality of patches; (Li et al., Abstract, Col. 2 Lines 31 - 64, Col. 3 Lines 29 - 51, Col. 5 Lines 22 - 51, Col. 9 Lines 55 - 67) and identifying positional relationships between patches that: are visually similar; and are of different frames in the first shot. (Li et al., Col. 3 Lines 29 - 65, Col. 5 Lines 52 - 63, Col. 6 Lines 54 - 63, Col. 7 Lines 50 - 63, Col. 8 Line 44 - Col. 9 Line 7, Col. 10 Lines 10 - 42, Col. 11 Line 27 - Col. 12 Line 11) Mun et al. in view of Chu et al. and Li et al. are combinable because they are all directed towards image processing methods for video scene understanding applications. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the combined teachings of Mun et al. in view of Chu et al. with the teachings of Li et al. This modification would have been prompted in order to substitute the first model of Mun et al. for the vision transformer model of Li et al. The vision transformer model of Li et al. could be substituted in place of the first model of Mun et al. utilizing well-known techniques in the art and would likely yield predictable results, in that in the combination the video transformer model of Li et al. would be utilized to generate the representations of the shots. This combination could be completed according to well-known techniques in the art and would likely yield predictable results, in that the video transformer model of Li et al. would be utilized to generate the representations of the shots. Therefore, it would have been obvious to combine Mun et al. in view of Chu et al. with Li et al. to obtain the invention as specified in claim 5.
Claims 6 and 13 are rejected under 35 U.S.C. 103 as being unpatentable over Mun et al., “Boundary-Aware Self-Supervised Learning For Video Scene Segmentation”, arXiv, arXiv:2201.05277v1, 14 Jan. 2022, pages 1 - 20, herein referred to as “Mun et al.”, in view of Chu et al. U.S. Publication No. 2019/0147105 A1 as applied to claims 1 and 11 above, and further in view of Aoki et al. U.S. Publication No. 2009/0052783 A1.
- With regards to claim 6, Mun et al. in view of Chu et al. disclose the method of claim 1. Mun et al. fail to disclose explicitly using the second model to generate information indicating a positional relationship of a common object found in sequential frames of different shots. Pertaining to analogous art, Aoki et al. disclose using the second model to generate information indicating a positional relationship of a common object found in sequential frames of different shots. (Aoki et al., Abstract, Figs. 10 - 13, Pg. 2 ¶ 0032 - 0033, Pg. 3 ¶ 0038 - 0046, Pg. 4 ¶ 0051 - 0060, Pg. 5 ¶ 0072 - 0075) Mun et al. in view of Chu et al. and Aoki et al. are combinable because they are all directed towards image processing methods for video scene segmentation. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the combined teachings of Mun et al. in view of Chu et al. with the teachings of Aoki et al. This modification would have been prompted in order to enhance the combined base device of Mun et al. in view of Chu et al. with the well-known and applicable technique Aoki et al. applied to a comparable device. Generating information indicating a positional relationship of a common object found in sequential frames of different shots, as taught by Aoki et al., would enhance the combined base device by improving its ability to accurately and reliably identify similarities and differences between different shots of video since, in addition to the semantic content of the frames, the locations of the semantic content within the frames of the different shots would be analyzed when evaluating the different shots for scene boundaries. This combination could be completed according to well-known techniques in the art and would likely yield predictable results, in that information indicating a positional relationship of a common object found in sequential frames of different shots would be generated so as to improve the ability of the combined base device to accurately and reliably identify similarities and differences between different shots of video since, in addition to the semantic content of the frames, the locations of the semantic content within the frames would be analyzed when evaluating the different shots for scene boundaries. Therefore, it would have been obvious to combine Mun et al. in view of Chu et al. with Aoki et al. to obtain the invention as specified in claim 6.
- With regards to claim 13, Mun et al. in view of Chu et al. disclose the method of claim 11. Mun et al. fail to disclose explicitly wherein the second model is configured to generate information indicating a positional relationship of a common object found in sequential frames of different shots. Pertaining to analogous art, Aoki et al. disclose wherein the second model is configured to generate information indicating a positional relationship of a common object found in sequential frames of different shots. (Aoki et al., Abstract, Figs. 10 - 13, Pg. 2 ¶ 0032 - 0033, Pg. 3 ¶ 0038 - 0046, Pg. 4 ¶ 0051 - 0060, Pg. 5 ¶ 0072 - 0075) Mun et al. in view of Chu et al. and Aoki et al. are combinable because they are all directed towards image processing methods for video scene segmentation. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the combined teachings of Mun et al. in view of Chu et al. with the teachings of Aoki et al. This modification would have been prompted in order to enhance the combined base device of Mun et al. in view of Chu et al. with the well-known and applicable technique Aoki et al. applied to a comparable device. Generating information indicating a positional relationship of a common object found in sequential frames of different shots, as taught by Aoki et al., would enhance the combined base device by improving its ability to accurately and reliably identify similarities and differences between different shots of video since, in addition to the semantic content of the frames, the locations of the semantic content within the frames of the different shots would be analyzed when evaluating the different shots for scene boundaries. This combination could be completed according to well-known techniques in the art and would likely yield predictable results, in that information indicating a positional relationship of a common object found in sequential frames of different shots would be generated so as to improve the ability of the combined base device to accurately and reliably identify similarities and differences between different shots of video since, in addition to the semantic content of the frames, the locations of the semantic content within the frames would be analyzed when evaluating the different shots for scene boundaries. Therefore, it would have been obvious to combine Mun et al. in view of Chu et al. with Aoki et al. to obtain the invention as specified in claim 13.
Claims 17 - 19 are rejected under 35 U.S.C. 103 as being unpatentable over Rao et al., “A Local-to-Global Approach to Multi-modal Movie Scene Segmentation”, arXiv, arXiv:2004.02678v3, 28 Apr. 2020, pages 1 - 10, herein referred to as “Rao et al.”, in view of Houlsby et al. U.S. Publication No. 2022/0108478 A1.
- With regards to claim 17, Rao et al. disclose a method (Rao et al., Pg. 1 Abstract, Pg. 4 § 4, Pg. 8 § 6) comprising: receiving intra-shot information indicating areas of attention in frames of each of a plurality of shots of a content item; (Rao et al., Pg. 2 Left-Hand Column First-Full Paragraph - Second-Full Paragraph, Pg. 4 § 4 - § 4.3) generating, based on the intra-shot information, inter-shot information indicating visual relationships between frames of different shots of a same content item; (Rao et al., Pg. 4 § 4.2 - § 4.4) and sending the inter-shot information to a prediction model for identifying scene boundaries within the content item. (Rao et al., Pg. 4 § 4 and § 4.2 - § 4.4, Pg. 5 Fig. 3) Rao et al. fail to disclose explicitly a computing device. Pertaining to analogous art, Houlsby et al. disclose receiving, by a computing device, (Houlsby et al., Abstract, Pg. 1 ¶ 0017 - Pg. 2 ¶ 0018, Pg. 5 ¶ 0063 - 0064, Pg. 8 ¶ 0101, Pg. 11 ¶ 0143 - 0146, Pg. 12 ¶ 0148 - 0151) information indicating areas of attention in frames of a content item. (Houlsby et al., Abstract, Pg. 2 ¶ 0022 - 0025, Pg. 4 ¶ 0054 - 008, Pg. 5 ¶ 0066 - Pg. 6 ¶ 0072, Pg. 6 ¶ 0078 - Pg. 7 ¶ 0083, Pg. 8 ¶ 0097 - 0099) Rao et al. and Houlsby et al. are combinable because they are both directed towards image and/or video processing methods that generate one or more representations that characterize an image or video. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Rao et al. with the teachings of Houlsby et al. This modification would have been prompted in order to enhance the base device of Rao et al. with the well-known and applicable technique Houlsby et al. applied to a similar device. Utilizing a computing device to implement a method, as taught by Houlsby et al., would enhance the base device of Rao et al. by allowing for it to be implemented accurately and efficiently at high computational speed on computer architecture. This combination could be completed according to well-known techniques in the art and would likely yield predictable results, in that a computing device would be utilized to perform the operations of the base device of Rao et al. so as to ensure that its operations are carried out accurately, efficiently and at high computational speed on computer architecture. Therefore, it would have been obvious to combine Rao et al. with Houlsby et al. to obtain the invention as specified in claim 17.
- With regards to claim 18, Rao et al. in view of Houlsby et al. disclose the method of claim 17, further comprising applying a model to the frames of the content item, (Rao et al., Pg. 4 § 4 - § 4.1) and providing output from the model to a gated state space model. (Rao et al., Pg. 4 § 4.2 - § 4.3, Pg. 5 Fig. 3 [“a Bi-LSTM”]) Rao et al. fail to disclose explicitly a self-attention model. Pertaining to analogous art, Houlsby et al. disclose applying a self-attention model to the frames of the content item, (Houlsby et al., Abstract, Pg. 2 ¶ 0022 - 0025, Pg. 4 ¶ 0054 - 008, Pg. 5 ¶ 0066 - Pg. 6 ¶ 0072, Pg. 6 ¶ 0078 - Pg. 7 ¶ 0083, Pg. 8 ¶ 0097 - 0099) and providing output from the self-attention model to a model. (Houlsby et al., Figs. 1 & 4, Pg. 4 ¶ 0056 - 0059, Pg. 5 ¶ 0060 - 0063, Pg. 7 ¶ 0090, Pg. 8 ¶ 0105 - 0106) It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the combined teachings of Rao et al. in view of Houlsby et al. with additional teachings of Houlsby et al. This modification would have been prompted in order to substitute the model of Rao et al. for the self-attention model(s) of Houlsby et al. The self-attention model(s) of Houlsby et al. could be substituted in place of the model of Rao et al. utilizing well-known techniques in the art and would likely yield predictable results, in that in the combination the self-attention model(s) of Houlsby et al. would be utilized to generate the representation(s) characterizing the frame(s) of the content item. This combination could be completed according to well-known techniques in the art and would likely yield predictable results, in that the self-attention model(s) of Houlsby et al. would be utilized to generate the representation(s) characterizing the frame(s) of the content item. Therefore, it would have been obvious to combine Rao et al. in view of Houlsby et al. with additional teachings of Houlsby et al. to obtain the invention as specified in claim 18.
- With regards to claim 19, Rao et al. in view of Houlsby et al. disclose the method of claim 17, further comprising applying a plurality of different models to the frames of the content item, (Rao et al., Pg. 4 § 4 - § 4.1, Pg. 5 Fig. 3) wherein the different models are configured to focus on different types of visual features; (Rao et al., Pg. 4 § 4 - § 4.1, Pg. 5 Fig. 3) and wherein each of the different models is configured to provide output to a corresponding gated state space model. (Rao et al., Pg. 4 § 4.2 - § 4.3, Pg. 5 Fig. 3 [“a Bi-LSTM”]) Rao et al. fail to disclose explicitly different self-attention models. Pertaining to analogous art, Houlsby et al. disclose applying a plurality of different self-attention models to the frames of the content item, (Houlsby et al., Abstract, Pg. 2 ¶ 0022 - 0025, Pg. 4 ¶ 0054 - 008, Pg. 5 ¶ 0066 - Pg. 6 ¶ 0072, Pg. 6 ¶ 0078 - Pg. 7 ¶ 0083, Pg. 8 ¶ 0097 - 0099) wherein the different self-attention models are configured to focus on different types of visual features; (Houlsby et al., Abstract, Pg. 2 ¶ 0022 - 0025, Pg. 4 ¶ 0054 - 008, Pg. 5 ¶ 0066 - Pg. 6 ¶ 0072, Pg. 6 ¶ 0078 - Pg. 7 ¶ 0083, Pg. 8 ¶ 0097 - 0099) and wherein each of the different self-attention models is configured to provide output to a corresponding model. (Houlsby et al., Figs. 1 & 4, Pg. 4 ¶ 0056 - 0059, Pg. 5 ¶ 0060 - 0063, Pg. 6 ¶ 0079, Pg. 7 ¶ 0090, Pg. 8 ¶ 0105 - 0106) It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the combined teachings of Rao et al. in view of Houlsby et al. with additional teachings of Houlsby et al. This modification would have been prompted in order to substitute the plurality of different models of Rao et al. for the plurality of different self-attention models of Houlsby et al. The plurality of different self-attention models of Houlsby et al. could be substituted in place of the plurality of different models of Rao et al. utilizing well-known techniques in the art and would likely yield predictable results, in that in the combination the plurality of different self-attention models of Houlsby et al. would be utilized to generate the representation(s) characterizing the frame(s) of the content item. This combination could be completed according to well-known techniques in the art and would likely yield predictable results, in that the plurality of different self-attention models of Houlsby et al. would be utilized to generate the representation(s) characterizing the frame(s) of the content item. Therefore, it would have been obvious to combine Rao et al. in view of Houlsby et al. with additional teachings of Houlsby et al. to obtain the invention as specified in claim 19.
Claim 20 is rejected under 35 U.S.C. 103 as being unpatentable over Rao et al., “A Local-to-Global Approach to Multi-modal Movie Scene Segmentation”, arXiv, arXiv:2004.02678v3, 28 Apr. 2020, pages 1 - 10, herein referred to as “Rao et al.”, in view of Houlsby et al. U.S. Publication No. 2022/0108478 A1 as applied to claim 17 above, and further in view of Chu et al. U.S. Publication No. 2019/0147105 A1.
- With regards to claim 20, Rao et al. in view of Houlsby et al. disclose the method of claim 17, further comprising executing the prediction model to use the scene boundaries to generate segments of the content item. (Rao et al., Pg. 1 Abstract, Pg. 2 Left-Hand Column First-Full Paragraph - Third-Full Paragraph, Pg. 2 Right-Hand Column Third-Full Paragraph, Pgs. 4 - 5 § 4.3 - § 4.4, Pg. 5 Fig. 3, Pg. 8 § 6) Rao et al. fail to disclose explicitly the computing device and controlling playback of the content item based on the segments. Pertaining to analogous art, Chu et al. disclose executing, by the computing device, (Chu et al., Abstract, Fig. 1A, Pg. 1 ¶ 0022 - 0023, Pg. 2 ¶ 0038 - Pg. 3 ¶ 0040, Pg. 3 ¶ 0048, Pg. 4 ¶ 0056, Pg. 5 ¶ 0068 - Pg. 6 ¶ 0075) the prediction model to use the scene boundaries to generate segments of the content item, (Chu et al., Abstract, Figs. 1A - 2, Pg. 1 ¶ 0005 - 0006 and 0020 - 0021, Pg. 2 ¶ 0024 - 0026, 0035 - 0036 and 0039, Pg. 3 ¶ 0044 - 0047, Pg. 4 ¶ 0052 - 0055, Pg. 5 ¶ 0065 - 0067) and to control playback of the content item based on the segments. (Chu et al., Pg. 2 ¶ 0024) Rao et al. in view of Houlsby et al. and Chu et al. are combinable because they are all directed towards image and/or video processing methods that generate one or more representations that characterize an image or video and, similar to Rao et al., Chu et al. is also directed towards video scene segmentation. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the combined teachings of Rao et al. in view of Houlsby et al. with the teachings of Chu et al. This modification would have been prompted in order to enhance the combined base device of Rao et al. in view of Houlsby et al. with the well-known and applicable technique Chu et al. applied to a comparable device. Controlling playback of the content item based on the segments, as taught by Chu et al., would enhance the combined base device by allowing for it to utilize the scene boundaries it determined in a number of related video processing functions and/or applications, such as smart fast-forwarding applications, so as to expand the number and variety of applications in which the combined base device may be utilized and increase its overall appeal and usefulness to potential end-users. This combination could be completed according to well-known techniques in the art and would likely yield predictable results, in that playback of the content item would be controlled based on the segments so as to expand the number and variety of applications in which the combined base device may be utilized and increase its overall appeal and usefulness to potential end-users. Therefore, it would have been obvious to combine Rao et al. in view of Houlsby et al. with Chu et al. to obtain the invention as specified in claim 20.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Carlson et al. U.S. Publication No. 2018/0082127 A1; which is directed towards video segmentation systems and methods, wherein features extracted from adjacent frames of a video are compared to determined shot boundaries and key frames extracted from different shots of the video are utilized to group the different shots of the video into scenes.
Chen et al. U.S. Patent No. 11,776,273; which is directed towards systems and methos for automatic scene change detection, wherein a video is portioned into a plurality of shots and an ensemble of machine learning models are utilized to infer one or more scene changes in the video based on the plurality of shots.
Huang et al. U.S. Publication No. 2012/0242900 A1; which is directed towards methods and devices for inserting secondary content into media streams, wherein a media stream is dividing into a plurality of shots, the plurality of shots are grouped into a plurality of scenes and secondary content is inserted into the media stream at a boundary point between consecutive scenes.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ERIC RUSH whose telephone number is (571) 270-3017. The examiner can normally be reached 9am - 5pm Monday - Friday.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Bee can be reached at (571) 270 - 5183. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/ERIC RUSH/Primary Examiner, Art Unit 2677