Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
The office action sent in response to Applicant’s communication received on 6/28/2024 for the application number 18759129. The office hereby acknowledges receipt of the following placed of record in the file: Specification, Abstract, Oath/Declaration and claims.
Priority
This application claims priority of the Chinese Patent Application No. 202310813786.8, filed on July 4, 2023, the disclosure of which is incorporated herein by reference in the present application.
Status of the claims
Claims 1-20 are presented for examination.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claims 1-2, 8-9 and 15-16 are rejected under 101
Claim 8 recites:
An electronic device, comprising: at least one processor; and a memory communicatively connected to the at least one processor, wherein the memory stores a computer program executable by the at least one processor, and the computer program, when executed by the at least one processor, causes the at least one processor to perform a method for generating video dialog question answering data, which comprises: determining target video description information corresponding to a target video; determining a target prompt used for a target question answering model, wherein the target question answering model is pre-configured based on a large language model, and the target prompt is capable of guiding the target question answering model to a desired dialog question answering effect based on the target video description information when the target question answering model executes a dialog question answering generation task; and outputting, using the target question answering model and based on the target video description information and the target prompt, dialog question answering data associated with the target video.
Claim 8 can be broken to the following steps
determining target video description information corresponding to a target video – human can look at the video and determine the video description, basically what the video is about
determining a target prompt used for a target question answering model, wherein the target question answering model is pre-configured based on a large language model – human can decide what prompt to write to invoke a particular model. The target question answering is an additional element
the target prompt is capable of guiding the target question answering model to a desired dialog question answering effect based on the target video description information when the target question answering model executes a dialog question answering generation task- human can decide how to write a prompt to be able to determine an answer which is present in the video
outputting, using the target question answering model and based on the target video description information and the target prompt, dialog question answering data associated with the target video.- human can determine the answer based on the video or the prompt which shows the answers
Step 1: This part of the eligibility analysis evaluates whether the claim falls within any statutory category. See MPEP 2106.03. The claim recites an electronic device, hence a machine , which is one of the statutory categories of invention. (Step 1: YES).
Step 2A, Prong One: This part of the eligibility analysis evaluates whether the claim recites a judicial exception. As explained in MPEP 2106.04, subsection II, a claim “recites” a judicial exception when the judicial exception is “set forth” or “described” in the claim. As discussed above, the broadest reasonable interpretation of steps (a)-(d) recites mental process for e.g. for e.g. human can watch a video and decides how to prompt the language model to guide the user at a portion where answer is present and output (writes down ) a specific answer hence the claim encompasses mental processes practically performed in the human mind by observation, evaluation, judgment, and opinion. See MPEP 2106.04(a)(2), subsection III. (Step 2A, Prong One: YES).
Step 2A, Prong Two: This part of the eligibility analysis evaluates whether the claim as a whole integrates the recited judicial exception into a practical application of the exception or whether the claim is “directed to” the judicial exception. This evaluation is performed by (1) identifying whether there are any additional elements recited in the claim beyond the judicial exception, and (2) evaluating those additional elements individually and in combination to determine whether the claim as a whole integrates the exception into a practical application. See MPEP 2106.04(d). The claim recites question answering model is pre-configured based on a large language model and prompting a model. The large language model and prompting a model which execute instruction for a large language models recited at a high level of generality since the prompt pertains to any LLM and the instructions in the prompt is executed by the LLM and hence these mere instructions to apply the exception using a generic computer component. The other additional element like processor, memory and/or non-transitory computer readable medium is recited at a high level of generality and these are mere generic computer component. Accordingly, this additional element does not integrate the abstract idea of human viewing a video and coming up with an answer based on watching a video into a practical application because it does not impose any meaningful limits on practicing the abstract idea.(Step 2A, Prong Two: NO), and the claim is directed to the judicial exception. (Step 2A: YES).
Step 2B. As discussed with respect to Step 2A Prong Two, the additional element in the claim amounts to no more than mere instructions to apply the exception using a generic computer component. The same analysis applies here in 2B, i.e., mere instructions to apply an exception using a generic computer component cannot integrate a judicial exception into a practical application at Step 2A or provide an inventive concept in Step 2B. The claim is ineligible.
Regarding claims 1 and 15, analysis analogous to claim 8, are applicable.
Regarding claims 2, 9 and 16, the video description information comprises a description of a video title, a description of a subject and a local detail event between subjects in a single frame of video picture, a description of a global detail event between subjects expressed sequentially in a plurality of consecutive frames of video pictures, a position of a subject in a video picture, and content of dialog text of the subject in the video picture is a mere data gathering activity. This activity does not integrate an abstract idea into a practical application; therefore, the analysis applicable under prong 2, step 2 and step 2b (as discussed regarding claim 8), applies to this situation as well.
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.
Claims 1, 5, 8 , 12, 15 and 19 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Wang( Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners)
Regarding claim 1, Wang teaches a method for generating video dialog question answering data ( video-to-text generation tasks, such as video captioning and video question answering, Under 3.2 Video Level: Temporal-Aware Few-shot Prompting) , the method comprising: determining target video description information corresponding to a target video (Each video instance is represented by the aggregated visual tokens3 , e.g., "Objects: First, bath toy. Then,...", the frame captions, such as "Frame Captions: First, a toddler playing in a bathtub filled with toys. Then,...", and the ASR inputs if available, e.g., "Subtitle:"., Fig 2; 3 Method ) ; determining a target prompt used for a target question answering model, wherein the target question answering model is pre-configured based on a large language model (The few-shot prompt consists of three parts: instruction, few-shot context, and task query. The instruction is a concise description of the generation task, e.g., "Generate a video caption based on the objects, events, attributes and frame captions. Example:", which is proved to be effective in zero-shot and few-shot settings [6, 59]. The few-shot context contains the selected in-context examples as well as the test video instance. Each video instance is represented by the aggregated visual tokens3 , e.g., "Objects: First, bath toy. Then,...", the frame captions, such as "Frame Captions: First, a toddler playing in a bathtub filled with toys. Then,...", and the ASR inputs if available, e.g., "Subtitle:". Finally, the task query is a task-specific suffix indicating the target text format, e.g. "Video Caption:". For in-context examples (omitted here for simplicity), the task query is followed by ground truth annotation, while for the test instance, the generation starts at the end of the task query. Under 3.3) , and the target prompt is capable of guiding the target question answering model to a desired dialog question answering effect based on the target video description information when the target question answering model executes a dialog question answering generation task ( prompt template ( answer) in response to a question is an input to a language model, Fig 2-3) and outputting, using the target question answering model and based on the target video description information and the target prompt, dialog question answering data associated with the target video ( for e.g. bathtub is the dialog answer data, Under 3. Method, Fig 2; or Fig 3 using temporal prompting -- Temporal-aware prompt successfully distinguishes the Sunset and Sunrise scenarios based on the temporal ordering change of objects and frame captions, while the static prompt fails.)
Regarding claim 5, Wang as above in claim 1, teaches wherein the outputting, using the target question answering model and based on the target video description information and the target prompt, video dialog question answering data associated with the target video comprises: adding the target video description information to a preset position indicated by the target prompt, to obtain target input information for the target question answering model ( Fig 10 – prompts) ; and controlling, based on the target input information, the target question answering model to execute the dialog question answering generation task, and outputting the video dialog question answering data of the target video based on the execution of the dialog question answering generation task ( answering based on the prompt, Fig 10)
Regarding claim 8, arguments analogous to claim 1, are applicable. In addition, Wang teaches An electronic device, comprising: at least one processor; and a memory communicatively connected to the at least one processor, wherein the memory stores a computer program executable by the at least one processor, and the computer program, when executed by the at least one processor, causes the at least one processor to perform a method of claim 1 ( fig 1-3 )
Regarding claim 12, arguments analogous to claim 5, are applicable.
Regarding claim 15, arguments analogous to claim 1, are applicable. In addition, Wang teaches A non-transitory computer-readable medium, storing computer instructions that, when executed by a processor, cause a method for generating video dialog question answering data to be implemented, as in claim 1 ( Fig 1-3)
Regarding claim 19, arguments analogous to claim 5, are applicable.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
And
KSR, 550 U.S. at 418, 82 USPQ2d at 1396. Exemplary rationales that may support a conclusion of obviousness include:
(A) Combining prior art elements according to known methods to yield predictable results;
(B) Simple substitution of one known element for another to obtain predictable results;
(C) Use of known technique to improve similar devices (methods, or products) in the same way;
(D) Applying a known technique to a known device (method, or product) ready for improvement to yield predictable results;
(E) "Obvious to try" – choosing from a finite number of identified, predictable solutions, with a reasonable expectation of success;
(F) Known work in one field of endeavor may prompt variations of it for use in either the same field or a different one based on design incentives or other market forces if the variations are predictable to one of ordinary skill in the art;
(G) Some teaching, suggestion, or motivation in the prior art that would have led one of ordinary skill to modify the prior art reference or to combine prior art reference teachings to arrive at the claimed invention.
See MPEP § 2143 for a discussion of the rationales listed above along with examples illustrating how the cited rationales may be used to support a finding of obviousness. See also MPEP § 2144 - § 2144.09 for additional guidance regarding support for obviousness determination.
Claims 2, 9 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Wang( Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners Audio Visual Scene-Aware Dialog) and further in view of Rose ( US 20220139383)
Regarding claim 2, Wang as above in claim 1, teaches a description of a subject and a local detail event between subjects in a single frame of video picture ( object and event, Fig 2 ) , a description of a global detail event between subjects expressed sequentially in a plurality of consecutive frames of video pictures ( f As shown in the few-shot context in Figure 2, each visual token and frame caption is prefixed with a natural language phrase indicating its temporal ordering, e.g., "First,","Then,", and "Finally,".. Under 3.3) , a position of a subject in a video picture ( for e.g. toddler in the bathtub, Fig 2) , and content of dialog text of the subject in the video picture ( video level (for what the toddler is doing), Fig 2 and ASR for video captioning )
Wang does not explicitly teach wherein the video description information comprises a description of a video title
However, Rose teaches the video description information comprises a description of a video title
(the topic determiner 108 may determine the topic of discussion by determining one or more keywords associated with a topic from the transcript, applying data representative of the transcript to a neural network to compute data indicative of the topic, by employing computer vision to analyze frames of video included in the stream data 104 to determine the topic, and/or by analyzing metadata (e.g., title, description, host, creator, and/or author) associated with the stream data 104., Para 0026)
It would have been obvious having the teachings of Wang to further include the concept of Rose before effective filing date so to determine information related to video for e.g. topic etc. ( Para 0026, Rose)
Regarding claim 9, arguments analogous to claim 2, are applicable.
Regarding claim 16, arguments analogous to claim 2, are applicable.
Claims 3, 10 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Wang( Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners) and further in view of Rose ( US 20220139383) and further in view of Pujari ( US 20240048839)
Regarding claim 3, Wang as above in claim 2, teaches wherein the determining target video description information corresponding to a target video comprises: detecting a subject appearing in at least two frames of video pictures extracted from the target video ( multiple frame index) , and determining based on a frame, a position of the subject in the video picture (we consider the frame index from which they are extracted from as their temporal indicator. If a visual token occurs in multiple frames, we use the averaged frame index as its temporal indicator, under C Additional Experimental Details – for e.g. where and what a subject is doing , Fig 2, Fig 3) ; and determining, upon detecting that there is a subject appearing in the video picture and that a text subtitle appears in the video picture, the appearing text subtitle as content of dialog text of the subject in the video picture ( Generate a video caption based on the objects, events, attributes, frame captions and subtitle and further refer to Answering questions based on -- Answer the question based on the objects, events, attributes and frame captions. Example, Page 20-22); and upon detecting that there is a subject appearing in the video picture and that there is a matching audio in the video picture, converting the matching audio into text, and then determining the text as content of dialog text of the subject in the video picture ( Subtitle is an ASR transcript, Page 21-22)
Wang modified by Rose does not explicitly teach detecting whether there is a subject appearing in at least two frames of video pictures extracted from the target video; and determining, upon detecting that there is a subject appearing, a position of the subject in the video picture
However Pujari teaches detecting whether there is a subject appearing in at least two frames of video pictures extracted from the target video; and determining, upon detecting that there is a subject appearing, a position of the subject in the video picture ( Object detection systems generally include (i) an object detection component (e.g., a first machine learning model) that identifies the presence and location of one or more objects in the frames of the video and (ii) an object tracking component (e.g., a second machine learning model) that tracks the movement of each detected object over time across the frames of the video, Para 0047)
It would have been obvious having the teachings of Wang which mentions the concept of using atleast 4 frames to include the concept of Pujari to track object in atleast two or across frames to improve accuracy in detection of the object/subject
Regarding claim 10, arguments analogous to claim 3, are applicable.
Regarding claim 17, arguments analogous to claim 3, are applicable.
Claims 4, 11 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Wang( Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners) and further in view of Agarwal (US 11468675 )
Regarding claim 4, Wang as above in claim 1, teaches, wherein the determining a target prompt used for a target question answering model comprises: determining an application scenario of the dialog question answering data corresponding to the target video ( object, attribute of the frames, Fig 2, Under 3.2. and 3.3) ; and determining, from candidate prompts associated with the target question answering model, a target prompt matching the application scenario of the dialog question answering data corresponding to the target video ( At video level, we construct video representation by aggregating the visual tokens, frame captions and other text modalities such as ASR, using a few-shot temporal aware prompt. We then feed the prompt to a pre-trained language model together with task-specific instructions to generate target text for a variety of video-language tasks. Examples of the full prompt for different tasks can be found in Appendix B., Fig 2)
Wang does not explicitly teach determining, in response to a select operation for the target video, an application scenario of the dialog question answering data corresponding to the target video
However, Agarwal teaches determining, in response to a select operation for the target video, an application scenario of the dialog question answering data corresponding to the target video ( the content selection module 708 may select a single instance of video content or the content selection module 708 may select more than one instance of video content. The selection of video content may be in response to received user input (e.g., the manual selection process 206 of FIG. 2) or the video content may be selected according to a predetermined rules set; and further once the video is selected determine the scene, Fig 8- At 804, a plurality of scenes within video content may be identified ,Col 19, line 1-15)
It would have been obvious for Wang to further include the concept of Agarwal that scene is detected after the selection of video so in a case of multiple video the scene detection happen when the selection takes place and it’s known in the art that user interface may have multiple video and processing takes place when the selection happens ( fig 8, Agarwal)
Regarding claim 11, arguments analogous to claim 4, are applicable.
Regarding claim 18, arguments analogous to claim 4, are applicable.
Claims 6, 13 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Wang( Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners Audio Visual Scene-Aware Dialog) and further in view of Muhammad ( Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models)
Regarding claim 6, Wang as above in claim 1, teaches wherein the target prompt is configured with first prompt information, second prompt information, third prompt information, fourth prompt information, and fifth prompt information, instruct the target question answering model to simulate watching of the target video to execute the dialog question answering generation task of creating and answering a question (feed each frame to a language model which generates frame caption and object attribute etc., Fig 2) , the second prompt information is used to instruct the target question answering model to use details comprised in the target video description information when the target question answering model executes the dialog question answering generation task so that an answer fits the target video description information ( details for e.g. object/attributed etc., Fig 10; since the model is a large language style model these instructions are called prompts), the third prompt information is used to instruct the target question answering model to give a definite answer when the target question answering model executes the dialog question answering generation task (model to answer question/attributes (fig 10) , Fig 2) , the fourth prompt information is used to instruct the target question answering model to ask a question of a preset type when the target question answering model executes the dialog question answering generation task (prompt question, Fig 2, Fig 10) , and the fifth prompt information is used to instruct the target question answering model to give an answer comprising a detailed reasoning process when the target question answering model executes the dialog question answering generation task ( answering, Fig 10-12)
While Wang design takes video frames as an input and passes through an Image language model, it does not explicitly teach the first prompt information is used to instruct the target question answering model to simulate watching of the target video to execute the dialog question answering generation task of creating and answering a question
In the same field of endeavor Muhammad teaches the first prompt information is used to instruct the target question answering model to simulate watching of the target video to execute the dialog question answering generation task of creating and answering a question ( prompts based on the following template: USER: <instruction> <vid-tokens> Assistant: Using the notations, we can represent it as, USER: <Q1><Qv>Assistant:, Under 3.2 Video Instruction Training )
It would have been obvious for Wang to include the concept of Muhammad before effective filing date since the large language model takes instruction as prompt and Wang image language model substituted with large language model and it would be obvious for to try for Wang as large model are increasingly become well known.
Regarding claim 13, arguments analogous to claim 6, are applicable.
Regarding claim 20, arguments analogous to claim 6, are applicable.
Claim 7 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Wang( Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners) and further in view of Xie(US 20220394344)
Regarding claim 7, Wang as above in claim 1, does not explicitly teach wherein after the outputting, using the target question answering model, dialog question answering data associated with the target video, the method further comprises: determining, in response to a filter operation for the dialog question answering data associated with the target video, target dialog question answering data from the dialog question answering data associated with the target video; and adjusting or replacing, in response to an edit operation for the target dialog question answering data, an answer corresponding to a question in the target dialog question answering data, so that an answer obtained through the adjustment or replacement fits the target video description
However Xie teaches determining, in response to a filter operation for the dialog question answering data associated with the target video, target dialog question answering data from the dialog question answering data associated with the target video ( editing operation, Para 0007) ; and adjusting or replacing, in response to an edit operation for the target dialog question answering data, an answer corresponding to a question in the target dialog question answering data, so that an answer obtained through the adjustment or replacement fits the target video description ( fig 21—editing operation, and the answer related to the video )
It would be obvious having the teachings of Wang to further include the concept of Xie before effective data to improve the efficiency of question answering ( Para 0005, Xie)
Regarding claim 14, arguments analogous to claim 7, are applicable.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Vogel ( US 20230321548) teaches method for generating video dialog question answering data, the method comprising: determining target description information (Para 0191) ; determining a target prompt used for a target question answering model, wherein the target question answering model is pre-configured based on a large language model (A personalized prompt for the LLM may therefore be derived from text describing one or more of these objectives or some other objective provided by the account holder, Para 0191)and the target prompt is capable of guiding the target question answering model to a desired dialog question answering effect based on the target video description information when the target question answering model executes a dialog question answering generation task ( time ranges and fragments of the video, Fig 16) ; and outputting, using the target question answering model and based on the target video description information and the target prompt, dialog question answering data associated with the target video ( Fig 18)
Hang ( Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding)
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Richa Sonifrank whose telephone number is (571)272-5357. The examiner can normally be reached M-T 7AM - 5:30PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Phan Hai can be reached at (571)272-6338. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/Richa Sonifrank/Primary Examiner, Art Unit 2654