DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
This Final Office Action is in response to the applicant’s remarks and arguments filed on 12/30/2025.
Claims 1-20, filed on 12/30/2025 are being considered on the merits.
Claims 1-20 remain pending in the application.
This action is in response to the applicant’s remarks and arguments submitted on 12/30/2025. In response to the last Office Action:
No claims were amended.
The rejection of claims 1-20 under 35 USC § 101 as being an abstract idea, previously set forth in the Non-Final Office Action mailed on 10/01/2025, has been maintained and updated below for reference.
Response to Arguments
The applicant’s remarks and/or arguments, filed on 12/30/2025 have been fully considered.
The examiner is entitled to give claim limitations their broadest reasonable interpretation in light of the specification. See MPEP 2111 [R-1] Interpretation of Claims-Broadest Reasonable Interpretation. The applicant always has the opportunity to amend the claims during prosecution, and broad interpretation by the examiner reduces the possibility that the claim, once issued, will be interpreted more broadly than is justified. In re Prater, 162 USPQ 541,550-51 (CCPA 1969).
Regarding the claim(s) rejections under 35 USC 101, Applicant's below arguments of the applicant’s remarks, found on pages 6-7, and filed on 12/30/2025, have been fully considered but they are not persuasive.
Applicant stated: “Applicant respectfully asserts that any abstract idea in the claims is integrated with a practical application, for example in an improvement to a technology, MPEP 2106.04(d)(l) sets out two requirements for showing an improvement to a technology.”, … “The present specification describes an improvement at least in paragraph 19, which states: Rather using a fixed set of modes to extract all textual information from the video up-front, a subset of the methods may be used initially…” …, “The claims reflect this improvement. For example, claim 1 recites, ''selecting a subset of the plurality of clips that are relevant to a query based on the first textual descriptions,” and ''generating additional textual descriptions for the selected subset using a second vision model." These elements reflect the use of a second model on a subset of the video clips based on relevance to a query. The present specification and claims therefore meet both requirements of MPEP 2106.04(d)(l). Notably these are the only requirements set out for showing an improvement to a technology, and thus no further analysis is needed. Because the claims as a whole reflect the improvement to a technology that is set out in the specification, they integrate any abstract idea therein with a practical application.”
The examiner respectfully disagrees with Applicant’s aforementioned remarks. Examiner points out to the applicant that under the analysis of independent claim 1 (and similarly claims 11), regarding the rejection of the aforementioned claims under 35 USC 101 being directed to abstract idea, the claim(s) recitation to use a “system”, “computer”, “hardware processor” and/or “memory”, nothing in the claim elements precludes the steps from practically being performed in a human mind.
Applicant argued that the aforementioned claim(s) recites the following steps: “selecting a subset of the plurality of clips that are relevant to a query based on the first textual descriptions.” This step demonstrates a process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind, but for the recitation of generic computer components as mentioned above. For example, and given some information at hand, a person is mentally (or with the aid of pen and paper) capable of evaluating this information at hand and compare this information to another set of information (i.e. the query), then select relevant information among the two sets of information, which again describes a mental process.
Furthermore, the aforementioned claim(s) argued limitation recites the following steps: “generating additional textual descriptions for the selected subset using a second vision model.” Again, this step repeats a process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind, but for the recitation of generic computer components. That is, other than reciting “system”, “computer”, “hardware processor” and/or “memory”, nothing in the claim elements precludes the steps from practically being performed in a human mind. For example, and given some video clips/pictures at hand, a person is mentally (or with the aid of pen and paper) capable of evaluating this visible media material and be able to provide textual description of this material according to certain criteria or yet another vision model (recited at high-level), which is a mental process.
The Applicant further argued that - “The present specification describes an improvement at least in paragraph 19, which states: Rather using a fixed set of modes to extract all textual information from the video up-front, a subset of the methods may be used initially…” …, “These elements reflect the use of a second model on a subset of the video clips based on relevance to a query.”
The examiner respectfully disagrees with Applicant’s aforementioned remarks. Examiner points out to the applicant that MPEP 2106.04(a)(2), under (“D). Both Product and Process Claims May Recite a Mental Process”, wherein examples of product claims reciting mental processes include: • An application program interface for extracting and processing information from a diversity of types of hard copy documents – Content Extraction, 776 F.3d at 1345, 113 USPQ2d at 1356.
Further, the Applicant argues that " Because the claims as a whole reflect the improvement to a technology that is set out in the specification, they integrate any abstract idea therein with a practical application." …, “Notably these are the only requirements set out for showing an improvement to a technology, and thus no further analysis is needed.” Examiner respectfully disagrees with Applicant’s conclusion that Prong Two is not satisfied, because the claim recites specific structural, procedural, and technical features that constitute a practical application of any alleged judicial exception. However, Step 2A, Prong 2 focuses on whether the combination of additional elements with the abstract idea integrates the idea into a practical application. The MPEP explains that "because a judicial exception is not eligible subject matter, ... if there are no additional claim elements besides the judicial exception, or if the additional claim elements merely recite another judicial exception, that is insufficient to integrate the judicial exception into a practical application." MPEP § 2106.04(II)(A)(2); see, e.g., RecogniCorp, LLC v. Nintendo Co., 855 F.3d 1322, 1327 (Fed. Cir. 2017) ("Adding one abstract idea ... to another abstract idea ... does not render the claim non-abstract."); ("[M]erely combining abstract ideas or adding conventional computer elements to an abstract idea does not confer patent eligibility."
Additionally, the Applicant references the instant application specification paragraph [0019], wherein the Applicant claims that “The present specification describes an improvement at least in paragraph 19…” Examiner notes that the refenced paragraph recites steps at a high-level of generality wherein the courts have consistently held that merely combining abstract ideas or adding conventional computer elements to an abstract idea does not confer patent eligibility. See Alice, 573 U.S. 208; Mayo, 566 U.S. 66; Bilski v. Kappas, 561 U.S. 593; Dealertrack, Inc. v. Huber, 674 F.3d 1315 (Fed. Cir.2012).
Accordingly, the claim(s) are not patent eligible.
Please see detailed analysis set forth under the 35 USC 101 rejection below
Applicant's below arguments in the applicant’s remarks regarding claims 1, 7, 11 and 17 found on pages 8-9 and filed on 12/30/2025, have been fully considered but they are not persuasive.
Applicant stated: “The rejection cites Zhao as teaching the selection in its discussion of analyzing visual features of video frames 202a-202c. However, there does not appear to be any selectivity in Zhao. The rejection cites the context-dependent layers 306 as reading on the first model and the posterior layers 309 as reading on the second model. As shown in Zhao's FIG. 3, the context-dependent layers 306 feed directly into the posterior byers 309. There is nothing in the reference which suggests any determination of relevance between the video clips, and nothing that suggests a selection of a subset of video clips on the basis of relevance.”
Regarding the aforementioned claim limitations, Examiner respectfully disagrees. Examiner asserts that the aforementioned limitation of independent claims 1 and 11, as drafted and given the broadest reasonable interpretation, are disclosed by the cited prior art to Zhao. In particular, and as cited in the last office action mailed on 10/01/2025, Zhao discloses in Para. [0038]: “…, the query-response-neural network architecture of the query-response system increases the amount and format of data representing context for answering a question during a video. In particular, the query-response system analyzes both visual features (e.g., uncaptioned visual context) corresponding to a particular video segment and textual features (e.g., textual context) for the video segment and the question itself.”
The examiner asserts that the reference discloses that “for answering a question during a video. In particular, the query-response system analyzes both visual features (e.g., uncaptioned visual context) corresponding to a particular video segment and textual features (e.g., textual context) …” to that of the claimed language “selecting a subset of the plurality of clips that are relevant to a query based on the first textual descriptions.” According to Zhao, the system “answering a question during a video”, wherein “the query-response system analyzes both visual features (e.g., uncaptioned visual context) corresponding to a particular video segment …”; and in Para. [0041]: “…, the present disclosure utilizes a variety of terms to describe features and benefits of the query-response system. For example, as used herein, the term “video segment” refers to a group of video frames from a digital video. In particular, a video segment can include a set of video frames that corresponds to a particular portion of a video. Such video frames may correspond to a time before, during, and/or after the query-response system receives a question from a user regarding the video.”
Regarding the rejection of claim 7 and claim 17, Applicant stated: “The rejection asserts that Zhao teaches this feature in its discussion of using response-network to generate candidate-response vectors, further citing a discussion of similarity to match textual-feature embeddings and a given training-sample-textual feature embedding. …, neither the textual-feature embeddings nor the training-sample-textual feature embedding can be reasonably interpreted as reading on an embedded query.”
Regarding the aforementioned claim limitations, Examiner respectfully disagrees. Examiner asserts that the aforementioned limitation of dependent claims 7 and 17, as drafted and given the broadest reasonable interpretation, are disclosed by the cited prior art to Zhao. In particular, and as cited in the last office action mailed on 10/01/2025, Zhao discloses in Para. [0046]: “…, the response-network layers generate the candidate-response vectors utilizing pre-trained vectors based on external domain knowledge to modify, weight, and/or filter candidate responses to a question. …, the response-network layers can learn embeddings of the candidate responses to generate the candidate-response vectors.”; and in Para. [0051]: “…, the term “similarity” as used herein refers to a relationship or likeness between vectors, embeddings, or other features. In particular, the query-response system can determine, based on a comparison, a similarity between textual-feature embeddings and training-sample-textual-feature embeddings …, the query-response system can use a threshold similarity to determine whether textual-feature embedding(s) and a given training-sample-textual feature embedding are a match.”
Additionally, the reference to Zhao discloses early in the reference in Para. [0025]: “…, the query-response system can utilize various network layers of a query-response-neural network to select a response to a user question received during playback of a video segment. In question-network layers of the query-response-neural network, for instance, the query-response system can apply an encoder to analyze and represent features of the question. …, the query-response system extracts a query vector from a transcribed or written version of the question as a vector representation of the question. To do so, the query-response system may convert the question from uttered speech to digital text (e.g., via a speech-to-text mechanism) or receive an electronic message from a client device. Based on the digital text of the question, the question-network layers can transform the question into one or more word embeddings or other formats as query vectors.”.
The examiner asserts that the reference discloses that “the question-network layers can transform the question into one or more word embeddings or other formats as query vectors” to that of the claimed language of claim 7 and similarly claim 17.
Please see the below set forth 35 USC 102 and 35 USC 103 rejection for further details.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claims 1-20 are rejected under 35 U.S.C 101 because the claimed invention is directed to abstract idea without significantly more.
Step 1: The claims are directed to a method and a system, wherein the claimed process discloses steps for pre-processing clips of an input video using a first vision model to generate respective first textual descriptions for the clips. A subset of the clips that are relevant to a query is selected based on the first textual descriptions. Additional textual descriptions are generated for the selected subset using a second vision model. The query is answered using the additional textual descriptions.
Step 2A – Prong One – The claims recite an abstract idea
Independent claims 1 and 11 are directed to an abstract idea without significantly more.
Independent claim 1 (and similarly claim 11) recites: “pre-processing a plurality of clips of an input video using a first vision model to generate respective first textual descriptions for the plurality of clips”, which is a process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind, but for the recitation of generic computer components. That is, other than reciting “system”, “computer”, “hardware processor” and/or “memory”, nothing in the claim elements precludes the steps from practically being performed in a human mind. For example, and given some video clips/picture at hand, a person is mentally (or with the aid of pen and paper) capable of evaluating this visible media material and be able to provide textual description of this material according to certain criteria or a vision model (recited at high-level), which is a mental process.
Further, the aforementioned claim(s) recites the following steps: “selecting a subset of the plurality of clips that are relevant to a query based on the first textual descriptions.” This step demonstrates a process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind, but for the recitation of generic computer components. That is, other than reciting “system”, “computer”, “hardware processor” and/or “memory”, nothing in the claim elements precludes the steps from practically being performed in a human mind. For example, and given some information at hand, a person is mentally (or with the aid of pen and paper) capable of evaluating this information at hand and compare this information to another set of information (i.e. the query), then select relevant information among the two sets of information, which again describes a mental process.
Additionally, the aforementioned claim(s) recites the following steps: “generating additional textual descriptions for the selected subset using a second vision model.” Again, this step repeats a process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind, but for the recitation of generic computer components. That is, other than reciting “system”, “computer”, “hardware processor” and/or “memory”, nothing in the claim elements precludes the steps from practically being performed in a human mind. For example, and given some video clips/picture at hand, a person is mentally (or with the aid of pen and paper) capable of evaluating this visible media material and be able to provide textual description of this material according to certain criteria or yet another vision model (recited at high-level), which is a mental process.
As explained above, a process of “pre-processing a plurality of clips …, to generate respective first textual descriptions …”, “selecting a subset of the plurality of …” and “generating additional textual descriptions for the selected subset …” are nothing more than an abstract idea.
Consequently, if a claim limitation, under its broadest reasonable interpretation, covers an abstract idea that includes a series of steps that recite mental steps, but for the recitation of generic computer components, then it falls within the “Mental Processes” and grouping of “Abstract Ideas”. Accordingly, the aforementioned claim(s) recite abstract ideas.
Step 2A – Prong Two - The abstract idea is not integrated into a practical application
This judicial exception is not integrated into a practical application. In particular, the aforementioned claims recite the additional limitation of – “answering the query using the additional textual descriptions.” The process of “answering the query …”, which is considered to be an insignificant extra-solution activity of mere data transmission steps, for which an extra-solution activity includes both pre-solution and/or post-solution activity. For example, the courts have decided that the use of a computer or other machinery in its ordinary capacity for economic or other tasks (e.g., to receive, store, or transmit data) or simply adding a general-purpose computer or computer components after the fact to an abstract idea, does not integrate a judicial exception into a practical application or provide significantly more, see MPEP 2106.05(f) and MPEP 2106.05(g).
The additional elements recited in the aforementioned claim(s) are: “system”, “computer”, “hardware processor” and/or “memory”. The additional elements of using a computer, storage device(s) and processor(s) to obtain information, analyze information, and manipulate information amounts to no more than mere instructions to apply the exception using generic computer components. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept.
The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception. See MPEP 2106.05(f).
Step 2B:
The claim(s) do not include additional elements that are sufficient to amount to significantly more than the judicial exception. The insignificant extra-solution activity identified above, which include the data-transmission activities (“answering the query using the additional textual descriptions” is/are recognized by the courts as well-understood, routine, and conventional activities when they are claimed in a merely generic manner (e.g., at a high level of generality) or as insignificant extra-solution activity (See MPEP 2106.05(d)(II)(i) Receiving or transmitting data over a network, e.g., using the Internet to gather data, buySAFE, Inc. v. Google, Inc., 765 F.3d 1350, 1355, 112 USPQ2d 1093, 1096 (Fed. Cir. 2014) (computer receives and sends information over a network); (v) Presenting offers and gathering statistics, OIP Techs., 788 F.3d at 1362-63, 115 USPQ2d at 1092-93).
Additionally, the system, processor and storage device(s) are recited at a high-level of generality such that it amounts no more than mere instructions to apply the exception using a generic computer component and cannot provide an inventive concept.
Thus, there are no additional elements that amount to significantly more than the above-identified judicial exception (the abstract idea). Looking at the limitations as an ordered combination adds nothing that is not already present when looking at the elements taken individually. There is no indication that any combination of elements improves the functioning of a computer or improves any other technology.
The claim(s) is not patent eligible.
Claim 2 is dependent on claim 1 and includes all the limitations of claim 1. The claim recites the additional limitations of “wherein the second vision model has more parameters than the first model.” This additional limitation elaborates on the abstract idea described above where a person can include additional criteria parameters when evaluating the image being examined/processed, which does not amount to significantly more than the abstract idea.
Claim 3 is dependent on claim 1 and includes all the limitations of claim 1. The claim recites the additional limitations of “wherein generating the additional textual descriptions is repeated until a language model is able to answer the query based on the additional textual descriptions.” This additional limitation elaborates on the abstract idea described above wherein a person continues the image/clips evaluation process to produce textual description of this media at hand and compare this information to another set of information (i.e. the query), then arrive at a conclusion relevant to the two sets of information (text description against the query request), which does not amount to significantly more than the abstract idea.
Claim 4 is dependent on claim 3 and includes all the limitations of claim 3. Further, the claim recites the additional limitations of “wherein each repetition of generating the additional textual descriptions instructs the second vision model to generate additional tokens that continue a previous output.” This additional limitation elaborates further on the abstract idea described above wherein a person continues the image/clips evaluation process to produce textual description of this media at hand to produce additional descriptive information of the media, which does not amount to significantly more than the abstract idea.
Claim 5 is dependent on claim 4 and includes all the limitations of claim 4. Further, the claim recites the additional limitations of “wherein the instruction to generate additional tokens includes increasing a maximum output token limit.” This additional limitation elaborates further on the abstract idea described above wherein a person continues the image/clips evaluation process to produce textual description of this media at hand to produce additional descriptive information of the media based on some limiting criteria, which does not amount to significantly more than the abstract idea.
Claim 6 is dependent on claim 3 and includes all the limitations of claim 3. Further, the claim recites the additional limitations of “prompting the language model to determine whether the language model can answer the query using the first textual descriptions and the additional textual descriptions.” This additional limitation elaborates on the abstract idea described above wherein a person continues the evaluation process to determine if the textual description of this media at hand provides a comparable response to the request (i.e. the query), and then adjust the process accordingly/prompting, which does not amount to significantly more than the abstract idea.
Claim 7 is dependent on claim 1 and includes all the limitations of claim 1. Further, the claim recites the additional limitations of “wherein selecting the subset of the plurality of clips includes embedding the query and the first textual descriptions in a latent space and selecting the subset of the plurality of clips according to a similarity between the embedded query and the embedded first textual descriptions.” The claim recited language of “wherein selecting the subset of the plurality of clips includes embedding the query and the first textual descriptions in a latent space” merely recites an insignificant extra solution activity of data transmission/storage of information. Further, the claim recited the following – “selecting the subset of the plurality of clips according to a similarity between the embedded query and the embedded first textual descriptions”, which details the proves of evaluating this information describing the media/image/clip at hand and compares this information to another set of information, which again describes a mental process.
Claim 8 is dependent on claim 1 and includes all the limitations of claim 1. The claim recites the additional limitations of “grouping the plurality of clips according to image complexity.” This additional limitation elaborates on the abstract idea described above where a person can include additional criteria to group/cluster the images being examined/processed, which does not amount to significantly more than the abstract idea.
Claim 9 is dependent on claim 1 and includes all the limitations of claim 1. The claim recites the additional limitations of “wherein pre-processing the plurality of clips includes identifying a domain of the input video and selecting the first vision model according to the domain.” This additional limitation elaborates on the abstract idea described above where a person can include additional criteria to identify a particular domain/sector/scope of the images being examined/processed, which does not amount to significantly more than the abstract idea.
Claim 10 is dependent on claim 9 and includes all the limitations of claim 9. The claim recites the additional limitations of “updating the domain based on the query.” This additional limitation elaborates on the abstract idea described above where a person can evaluate the domain/sector/scope of the image being examined/processed based on the user’s request/query, which does not amount to significantly more than the abstract idea.
Independent claim (11) recites similar limitations to claim 1, and therefore rejected for similar reasons as explained above.
Dependent claims (12-20) recite similar limitations to claims 2-10, hence are rejected for similar reasons as detailed above.
Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.
Claims 1-3, 7-13 and 17-20 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by US Patent (US 20220122357 A1) issued to Zhao et al. (hereinafter as “ZHAO”).
Regarding claim 1 (Original), ZHAO teaches a computer-implemented method for video analysis, comprising:
pre-processing a plurality of clips of an input video using a first vision model to generate respective first textual descriptions for the plurality of clips (ZHAO Para. [0041]: “…, the present disclosure utilizes a variety of terms to describe features and benefits of the query-response system. For example, as used herein, the term “video segment” refers to a group of video frames from a digital video. In particular, a video segment can include a set of video frames that corresponds to a particular portion of a video. Such video frames may correspond to a time before, during, and/or after the query-response system receives a question from a user regarding the video.”; and
Fig. 3, Para. [0065]: “…, the query-response system 106 analyzes the video frames 316 using the context-network layers 306. In FIG. 3, the video frames 316 include visual features 318 and transcript text 320… the query-response system 106 applies (i) one or more detection neural networks to the visual features 318 and (ii) a graphical-object-matching engine to the visual features 318 to generate visual-context vectors.”; and
Para. [0066]: “…, at the context-network layers 306, the query-response system 106 can extract from the video frames 316 dual contextual modalities from a video segment and thereby generate vectors representing both the visual features 318 and the transcript text 320 from the video segment.”,
the examiner notes that the reference discloses a system that uses detection neural context-network layers, i.e. vision models, to process a group of video frames, i.e. video clips, to generate visual-context vectors to that of pre-processing input video using a vision model to generate textual descriptions);
selecting a subset of the plurality of clips that are relevant to a query based on the first textual descriptions (ZHAO Para. [0038]: “…, the query-response-neural network architecture of the query-response system increases the amount and format of data representing context for answering a question during a video. In particular, the query-response system analyzes both visual features (e.g., uncaptioned visual context) corresponding to a particular video segment and textual features (e.g., textual context) for the video segment and the question itself.”; and
Fig. 2, Para. [0060]: “…, FIG. 2 illustrates video frames 202a-202c with reference to which the query-response system 106 can provide a response 212 to a question 210. As shown in FIG. 2, the video frames 202a-202c correspond to instructions 208 regarding how to perform a first operation using a software application (e.g., “How to perform ‘File-Save’”). Further, during playback of a video segment comprising the video frames 202a-202c, the query-response system 106 receives the question 210 regarding a second operation using the software application (e.g., “How do I open a new file?”).”; and
Fig. 2, Para. [0061]: “…, the query-response system 106 can also analyze visual features depicted in one or more of the video frames 202a-202c. For example, the query-response system 106 can analyze visual features representing a menu bar 204 in each of the video frames 202a-202c and visual features depicting a drop-down menu 206 in the video frames 202b-202c. Thus, even though the instructions 208 correspond to a first operation (and not the second operation), the query-response system 106 can perform acts and algorithms disclosed herein to provide the response 212 based on the visual features of the video frames 202a-202c.”,
the examiner asserts that the reference discloses that “for answering a question during a video. In particular, the query-response system analyzes both visual features (e.g., uncaptioned visual context) corresponding to a particular video segment and textual features (e.g., textual context) …” to that of the claimed language “selecting a subset of the plurality of clips that are relevant to a query based on the first textual descriptions.”);
generating additional textual descriptions for the selected subset using a second vision model (ZHAO Para. [0038]: “By applying context-network layers to generate context vectors representing both visual features and transcript text, the query-response system can account for visual information shown in the video segment and auditory information spoken during such a video segment. When the textual context does not sufficiently account for such visual information, the query-response system uses a neural network that captures another mode of visual information.”; and
Para. [0041]: “…, the present disclosure utilizes a variety of terms to describe features and benefits of the query-response system. For example, as used herein, the term “video segment” refers to a group of video frames from a digital video. In particular, a video segment can include a set of video frames that corresponds to a particular portion of a video. Such video frames may correspond to a time before, during, and/or after the query-response system receives a question from a user regarding the video.”; and
Fig. 3, Para. [0067]: “After the context-network layers 306 generates visual-context vectors and textual-context vectors, the query-response-neural network 302 can apply the posterior layers 309 to one or more of such context vectors—including one or both of the recurrent-neural-network layers 310 and the attention mechanism 312.”,
the examiner notes that after the context-network layers 306 generates visual-context vectors and textual-context vectors, the query-response-neural network 302 can apply the posterior layers 309 to one or more of such context vectors, to that of generating additional textual descriptions for the selected subset using a second vision model); and
answering the query using the additional textual descriptions (ZHAO Fig. 3, Para. [0062]: “As mentioned above, the query-response system 106 can utilize a query-response-neural network to analyze multiple contextual modalities of a video segment and therefore provide improved responses to questions. In accordance with one or more embodiments of the present disclosure, FIG. 3 illustrates a schematic diagram of the query-response system 106 utilizing a query-response-neural network 302 to provide a response 326 based on a question 314. As shown, the query-response-neural network 302 comprises question-network layers 304, context-network layers 306, response-network layers 308, and posterior layers 309. As further shown, the posterior layers 309 comprise recurrent-neural-network layers 310 and optionally (as denoted by the dotted lines) an attention mechanism 312. As described below, the query-response system 106 analyzes (i) the question 314 using the question-network layers 304, (ii) video frames 316 using the context-network layers 306, and (iii) candidate responses 324 using the response-network layers 308. In so doing, the query-response system 106 can generate the response 326 in real-time (or approximately real-time) after receiving the question 314”).
Regarding claim 2 (Original), ZAHO teaches the limitations of claim 1. Further, ZAHO teaches wherein the second vision model has more parameters than the first model (ZHAO Para. [0032]: “In addition to the question-network layers and the context-network layers, the query-response system can utilize posterior layers to analyze a query vector and corresponding context vectors. For instance, in some cases, the query-response system uses a recurrent neural network (“RNN”) to analyze one or both of visual-context vectors and textual-context vectors. In some implementations, RNN layers comprise bi-directional recurrent layers of one or more gated recurrent units (“GRUs”). By using GRUs or other RNN layers, the query-response system can capture visual cues or transcript cues from different frames of a video segment.”).
Regarding claim 3 (Original), ZAHO teaches the limitations of claim 1. Further, ZHAO teaches wherein generating the additional textual descriptions is repeated until a language model is able to answer the query based on the additional textual descriptions (ZHAO Fig. 1, Fig. 6A/B, Para. [0100]: “…, the query-response system 106 can utilize the detection neural network 604 to detect visual features within video frames of a video segment. FIG. 6B illustrates the query-response system 106 training the detection neural network 604 to detect visual features in accordance with one or more embodiments of the present disclosure. Although shown as a single training cycle, the query-response system 106 may perform the training acts and/or algorithms of FIG. 6B in an iterative manner, for example, until a point of convergence. As shown, the query-response system 106 generates synthetic-training images 626 based on various combinations of a background image 620 (e.g., a random background image) and a graphical object 622 (e.g., a software-user-interface component). Each such synthetic-training image can include an object superimposed on a background image. In some embodiments, the query-response system 106 utilizes objects from the external domain knowledge (e.g., the knowledge base 430 of FIG. 4) as a graphical object for generating the synthetic-training images 626.”).
Regarding claim 7 (Original), ZAHO teaches the limitation of claim 1. Further, ZHAO teaches wherein selecting the subset of the plurality of clips includes embedding the query and the first textual descriptions in a latent space and selecting the subset of the plurality of clips according to a similarity between the embedded query and the embedded first textual descriptions (ZHAO Para. [0046]: “…, in some embodiments, the response-network layers generate the candidate-response vectors utilizing pre-trained vectors based on external domain knowledge to modify, weight, and/or filter candidate responses to a question. Additionally or alternatively, the response-network layers can learn embeddings of the candidate responses to generate the candidate-response vectors.”; and
Para. [0051]: “…, the term “similarity” as used herein refers to a relationship or likeness between vectors, embeddings, or other features. In particular, the query-response system can determine, based on a comparison, a similarity between textual-feature embeddings and training-sample-textual-feature embeddings …, the query-response system can use a threshold similarity to determine whether textual-feature embedding(s) and a given training-sample-textual feature embedding are a match.”).
Regarding claim 8 (Original), ZAHO teaches the limitations of claim 1. Further, ZHAO teaches grouping the plurality of clips according to image complexity (ZHAO Fig. 3, Para. [0032]: “In addition to the question-network layers and the context-network layers, the query-response system can utilize posterior layers to analyze a query vector and corresponding context vectors. For instance, in some cases, the query-response system uses a recurrent neural network (“RNN”) to analyze one or both of visual-context vectors and textual-context vectors. In some implementations, RNN layers comprise bi-directional recurrent layers of one or more gated recurrent units (“GRUs”). By using GRUs or other RNN layers, the query-response system can capture visual cues or transcript cues from different frames of a video segment.”; and
Para. [0049]: “…, the term “attention mechanism” refers to specific types of network layers that identify features of importance or emphasis. In particular, an attention mechanism can generate or modify a vector to identify weighted portions or features of one or both of a query vector or context vectors. For example, the attention mechanism can include a spatial-attention mechanism, a temporal-attention mechanism, or both in a dual-attention mechanism. …, the spatial-attention mechanism analyzes visual features and corresponding spatial information from one or more frames of a video segment to generate a specific attention-weighted vector referred to as a “precursor query-context vector.”).
Regarding claim 9 (Original), ZAHO teaches the limitations of claim 1. Further, ZHAO teaches wherein pre-processing the plurality of clips includes identifying a domain of the input video and selecting the first vision model according to the domain (ZHAO Fig. 3, Para. [0069]: “…, the knowledge base 322 includes responses appropriate to any video on the domain or subject, even if one of those videos was not used to generate responses that are included in the knowledge base 322 as one of the candidate responses 324.”).
Regarding claim 10 (Original), ZAHO teaches the limitations of claim 9. Further, ZHAO teaches updating the domain based on the query (ZHAO Fig. 3, Para. [0069]: “To generate the candidate-response vectors just mentioned, the response-network layers 308 can identify responses from the candidate responses 324 based on a knowledge base 322. The knowledge base 322 can include responses for a plurality of questions, not just potential responses to the question 314. In one example, the knowledge base 322 includes responses to questions gathered for a plurality of videos on a domain or subject, not just a single video on that subject”).
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 4-6 and 14-16 are rejected under 35 U.S.C. 103 as being unpatentable over US Patent Application Publication (US 2022/0122357 A1) issued to Zhao et al. (hereinafter as “ZHAO”), and in view of US Patent (US 11,947,923 B1) issued to Jain et al. (hereinafter as “JAIN”).
Regarding claim 4 (Original), ZHAO teaches the limitations of claim 3. However, ZAHO does not explicitly teach wherein each repetition of generating the additional textual descriptions instructs the second vision model to generate additional tokens that continue a previous output.
But JAIN teaches wherein each repetition of generating the additional textual descriptions instructs the second vision model to generate additional tokens that continue a previous output (JAIN Fig. 1, Col. 15, line (49): “This process can be iteratively repeated for N iterations (where N is a positive integer that may be fixed (e.g., defined by a developer associated with the multimedia content management system 120) or dynamic (e.g., based on one or more of: a token limit for the LLM, a temporal constraint for the LLM, or a computational constraint for the LLM)).”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of ZHAO (disclosing methods for generating responses to queries about videos utilizing a multi-modal neural networks) to include the teachings of JAIN (disclosing methods for multimedia content management for large language model(s) and/or other generative model(s)) and arrive at a method to enable users of large language model (LLM) to fine-tuned an LLM for accuracy of query response. One of ordinary skill in the art would have been motivated to make this combination because by enabling the LLM to be fine-tuned according to users’ needs, thereby enabling the LLM to generate the LLM output including the sequence of tokens over the non-generative multimedia content tags and/or the generative multimedia content prompts, as recognized by (JAIN, Abstract, Col. (11-12)). In addition, the references of ZHAO and JAIN teach features that are directed to analogous art and they are directed to the same field of endeavor of management of large language models.
Regarding claim 5 (Original), the combination of ZHAO and JAIN teach the limitations of claim 4.
Further, JAIN teaches wherein the instruction to generate additional tokens includes increasing a maximum output token limit (JAIN Fig. 1, Col. 15, line (49): “This process can be iteratively repeated for N iterations (where N is a positive integer that may be fixed (e.g., defined by a developer associated with the multimedia content management system 120) or dynamic (e.g., based on one or more of: a token limit for the LLM, a temporal constraint for the LLM, or a computational constraint for the LLM)).”).
Regarding claim 6 (Original), ZAHO teaches the limitation of claim 3.
However, ZHAO does not explicitly teach prompting the language model to determine whether the language model can answer the query using the first textual descriptions and the additional textual descriptions.
But JAIN teaches prompting the language model to determine whether the language model can answer the query using the first textual descriptions and the additional textual descriptions (JAIN Fig. 2, Col. 14, line (5): “In implementations where the LLM output 204 itself and/or in the textual content 205 includes the non-generative multimedia content tag, the multimedia content retrieval engine 164 can determine a multimedia content query based on the non-generative multimedia content tag and submit the multimedia content query to a multimedia content search system (e.g., to one or more of the search system(s) 180 and over one or more of the networks 199). In response to the multimedia content query being submitted, the multimedia content retrieval engine 164 can receive the non-generative multimedia content item as the multimedia content 206 that is to be included in the response 208. Further, in implementations where the LLM output 204 itself and/or in the textual content 205 includes the generative multimedia content prompt, the multimedia content retrieval engine 164 can cause the non-generative multimedia content tag to be submitted to the given generative multimedia content model (e.g., via the generative system(s) 190 and over one or more of the networks 199). In response to the generative multimedia content prompt being submitted to the given generative multimedia content model, the multimedia content retrieval engine 164 can receive the generative multimedia content item as the multimedia content 206 that is to be included in the response 208.”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of ZHAO (disclosing methods for generating responses to queries about videos utilizing a multi-modal neural networks) to include the teachings of JAIN (disclosing methods for multimedia content management for large language model(s) and/or other generative model(s)) and arrive at a method to enable users of large language model (LLM) to fine-tuned an LLM for accuracy of query response. One of ordinary skill in the art would have been motivated to make this combination because by enabling the LLM to be fine-tuned according to users’ needs, thereby enabling the LLM to generate the LLM output including the sequence of tokens over the non-generative multimedia content tags and/or the generative multimedia content prompts, as recognized by (JAIN, Abstract, Col. (11-12)). In addition, the references of ZHAO and JAIN teach features that are directed to analogous art and they are directed to the same field of endeavor of management of large language models.
Regarding independent claim (11), the aforementioned claim recites similar limitations to Claim 1, and therefore rejected for similar reasons as mentioned above.
Regarding dependent claims (12-20), the aforementioned claims recite similar limitations to Claims (2-9), and therefore rejected for similar reasons as mentioned above.
Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
Terrell; (20050044105 A1); “Methods for delivery of content-specific video clips, wherein a database is maintained which stores multiple video "clips" or segments in video data files, extracted from full-length films and videos, including feature films. Each of the clips is associated with multiple keywords, including keywords related to the concepts presented or illustrated by the clip.”
Najdenkoska et al.; (US 20250094482 A1); “Methods for image categorization using a visual language model, wherein a query includes a request to categorize a query image of a class represented in the context and in response to the query, the visual language model performs an open-ended generative categorization of the query image.”
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Zuheir A Mheir whose telephone number is (571)272-4151. The examiner can normally be reached Monday - Friday 9:00 - 5:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Vital can be reached at (571) 272-4215. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
3/18/2026
/ZUHEIR A MHEIR/Patent Examiner, Art Unit 2198
/PIERRE VITAL/Supervisory Patent Examiner, Art Unit 2198