DETAILED ACTION
Introduction
1. This office action is in response to Applicant's submission filed on 04/29/2024. The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Claims 1-20 are currently pending and examined below.
Drawings
2. The drawings filed on 04/29/2024 have been accepted and considered by the Examiner.
Information Disclosure Statement
3. The Information Statements (IDSs) filed on 04/29/2024, 07/08/2024 have been accepted/considered and are in compliance with the provisions of 37 CFR 1.97.
Priority
4. The Applicants priority to U.S. Patent Application # 18482828, filed October 6, 2023, has been accepted and considered in this office action.
Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.
5. Claims 1-2, 6-8, 10-15 and 17-20 are rejected under 35 U.S.C. 102 (a) (2) as being anticipated by Zhang (U.S. Patent Application Publication # 2024/0037896 A1).
With regards to claim 1, Zhang teaches a computer-implemented method, comprising for each of a plurality of segments of a video processing the segment to generate a plurality of embeddings, each embedding of the plurality of embeddings corresponding to at least a portion of a subject matter represented in the segment of the plurality of segments of the video (Paragraphs 9-10, teach the method for answering questions about a video story by generating the input embedding including classifying the vision data, the subtitle data, and the question data into a plurality of categories; extracting feature vectors for the plurality of respective categories; generating a feature embedding, a segment embedding, and a position embedding using the extracted feature vectors; and generating the input embedding by summing the feature embedding, the segment embedding, and the position embedding where the plurality of categories may include one or more categories related to features of the character);
and representing the plurality of embeddings as a feature embedding for the segment of the video (Paragraphs 9-10, teach the method for answering questions about a video story by extracting feature vectors related to each character of a video from video data including vision data and subtitle data and question data for video questions and answers, and generating an input embedding using the feature vectors related to the character and training a transformer model using the input embedding);
receiving a query about the subject matter of the video (Para 37 and figures 1-5, teach an input/output unit configured to receive video data and question data and to output video story question answering results);
processing the query to generate a query embedding (Paragraphs 9-10, teach extracting feature vectors for the plurality of respective categories; generating a feature embedding, a segment embedding, and a position embedding using the extracted feature vectors; and generating the input embedding by summing the feature embedding, the segment embedding, and the position embedding);
projecting the query embedding in a vector space with each of the plurality of feature embeddings to determine one or more feature embeddings that are responsive to the query (Paragraphs 44-45, teach a control unit that generates an input embedding using the feature vectors extracted for the plurality of respective categories. The input embedding is the sum of a feature embedding, a segment embedding, and a modality-wise position embedding. The control unit generates the feature embedding by concatenating all feature vectors extracted for the plurality of respective categories, may generate the segment embedding by performing embedding lookups using a learnable embedding matrix for the plurality of respective categories, and may generate the modality-wise position embedding by generating vectors including position information related to the feature vectors extracted for the plurality of respective categories. The control unit answers questions about a video story by training the transformer model using the input embedding generated as described above);
determining, for each of the one or more feature embeddings that are responsive to the query and based at least on the plurality of embeddings included in the feature embedding, a descriptive text that is descriptive of the feature embedding and the subject matter of the segment corresponding to the feature embedding (Para 74 and equation 4, teach that the control unit generates the input embedding by adding the segment embedding corresponding to the category including each feature and the modality-wise position embedding according to the position of each feature to the feature embedding);
and generating, based at least in part on the descriptive text determined for each of the one or more feature embeddings, a natural language response that is responsive to the query representative of at least a portion of the subject matter of the video (Para 77 and figures 1-5, teach that the control unit applies VA as an input embedding to the encoder of the transformer model, and the decoder of the transformer model outputs an answer to a given question).
With regards to claim 2, Zhang teaches the computer-implemented method of claim 1, further comprising determining a context for the query, wherein the context is determined based at least in part on one or more of the query, a query and answer session in which the query is presented, the video, or a user (Paragraphs 9-10, teach the method for answering questions about a video story by generating the input embedding including classifying the vision data, the subtitle data, and the question data into a plurality of categories; extracting feature vectors for the plurality of respective categories; generating a feature embedding, a segment embedding, and a position embedding using the extracted feature vectors; and generating the input embedding by summing the feature embedding, the segment embedding, and the position embedding where the plurality of categories may include one or more categories related to features of the character);
and wherein generating the natural language response is further based at least in part on the context (Paragraphs 9-10, further teach the method for answering questions about a video story by extracting feature vectors related to each character of a video from video data including vision data and subtitle data and question data for video questions and answers).
With regards to claim 6, Zhang teaches the computer-implemented method of claim 1, further comprising receiving a second query about the subject matter of the video (Para 37 and figures 1-5, teach an input/output unit configured to receive video data and question data and to output video story question answering results. According to broadest reasonable interpretation, there is no limit in the system of Zhang on how many queries it can receive for a particular video);
processing the query to determine the query is a request for a report relating to at least a portion of the subject matter of the video (Paragraphs 9-10, teach extracting feature vectors for the plurality of respective categories; generating a feature embedding, a segment embedding, and a position embedding using the extracted feature vectors; and generating the input embedding by summing the feature embedding, the segment embedding, and the position embedding);
and generating, in a natural language format and based at least in part on at least a portion of the feature embeddings generated for each segment of the video, the report (Para 77 and figures 1-5, teach that the control unit applies VA as an input embedding to the encoder of the transformer model, and the decoder of the transformer model outputs an answer to a given question).
With regards to claims 7-8 and 12, these are system claims for the corresponding method claims 1-2 and 6. These two sets of claims are related as method and apparatus of using the same, with each claimed system element's function corresponding to the claimed method step. Accordingly, claims 7-8 are similarly rejected under the same rationale as applied above with respect to method claims 1-2 and 6.
With regards to claims 14-15 and 18-19, please see the rejections of claims 1-2 and 6 above.
With regards to claim 10, Zhang teaches the system of claim 7, wherein the program instructions that, when executed by the one or more processors, further cause the one or more processors to at least determine an activity of a plurality of activities that is occurring in the subject matter of the video (Para 11, teaches that the categories related to the features of the character may include a bounding box including the character in an image frame included in the video, the behavior of the character, and the emotion of the character);
and wherein the program instructions that cause the one or more processors to generate the natural language response further include program instructions that, when executed by the one or more processors, further cause the one or more processors to at least, generate the natural language response based at least in part on the activity and the feature embeddings (Para 12, teaches generating the feature embedding, the segment embedding, and the position embedding using the extracted feature vectors including generating the feature embedding by concatenating all the feature vectors extracted for the plurality of respective categories; generating the segment embedding by performing embedding lookups using a learnable embedding matrix for the plurality of respective categories and generating the position embedding by generating vectors including position information related to the feature vectors extracted for the plurality of respective categories. Para 37, teaches an input/output unit configured to receive video data and question data and to output video story question answering results based on the above embeddings).
With regards to claim 11, Zhang teaches the system of claim 7, wherein the program instructions that, when executed by the one or more processors, further cause the one or more processors to at least present the natural language response at least one of audibly, visually, or haptically (Para 105, teaches a display of a GUI as an output device).
With regards to claim 13, Zhang teaches the system of claim 7, wherein the program instructions that, when executed by the one or more processors, further cause the one or more processors to at least determine, based at least in part on the query embedding, a second plurality of segments of a second video that represent subject matter that is responsive to the query (Paragraphs 9-10, teach the method for answering questions about a video story by generating the input embedding including classifying the vision data, the subtitle data, and the question data into a plurality of categories; extracting feature vectors for the plurality of respective categories; generating a feature embedding, a segment embedding, and a position embedding using the extracted feature vectors; and generating the input embedding by summing the feature embedding, the segment embedding, and the position embedding where the plurality of categories may include one or more categories related to features of the character. According to broadest reasonable interpretation, there is no limit in the system of Zhang on how many video segments it can analyze);
and wherein the program instructions that cause the one or more processors to generate the natural language response, further include program instructions that, when executed by the one or more processors, further cause the one or more processors to at least determine, based at least in part on the feature embeddings generated for each of the plurality of segments of the video and feature embeddings generated for each of the second plurality of segments, the natural language response (Para 74 and equation 4, teach that the control unit generates the input embedding by adding the segment embedding corresponding to the category including each feature and the modality-wise position embedding according to the position of each feature to the feature embedding. Again, according to broadest reasonable interpretation, there is no limit in the system of Zhang on how many responses it can generate).
With regards to claims 17 and 20, these are method claims for the corresponding apparatus claims 10 and 13. These two sets of claims are related as method and apparatus of using the same, with each claimed system element's function corresponding to the claimed method step. Accordingly, claims 17 and 20 are similarly rejected under the same rationale as applied above with respect to apparatus claims 10 and 13.
Allowable Subject Matter
6. Claims 3-5, 9 and 16 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims. The prior art of record, alone or in combination, does not currently suggest or teach the invention as outlined in these claims. More detailed reasons for allowance will be outlined as and when the Application proceeds to allowability.
Conclusion
7. The following prior art, made of record but not relied upon, is considered pertinent to applicant's disclosure: Habibian (U.S. Patent Application Publication # 2017/0083623 A1), Ben-Ari (U.S. Patent # 2022/0318555 A1). These references are also included in the PTO-892 form attached with this office action.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. If you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). In case you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to NEERAJ SHARMA whose contact information is given below. The examiner can normally be reached on Monday to Friday 8 am to 5 pm. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Louis-Desir can be reached on 571-272-7799 (Direct Phone). The fax number for the organization where this application or proceeding is assigned is 571-273-8300.
/NEERAJ SHARMA/
Primary Examiner, Art Unit 2659
571-270-5487 (Direct Phone)
571-270-6487 (Direct Fax)
neeraj.sharma@uspto.gov (Direct Email)