Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.
(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.
Claims 1-4, 7-11, 14-18 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Zhang (CN 115359398 A), hereinafter Zhang.
Regarding claim 1, Zhang teaches A processing method for multimodal data, comprising: obtaining data to be processed of an original modality; (Abstract see "The invention provides a voice and video positioning model and a construction method, device and application thereof." Para. 10 see "Acquiring at least one audio-video file, marking the audio information and video information of the audio-video file." Para. 11 see " the voice and video positioning model includes… parallel video encoders and audio encoders." Para. 19 see " an electronic device, including a memory and a processor, the memory stores a computer program, and the processor is configured to run the computer program to execute a voice and video positioning model A construction method or a voice and video positioning method. "). and determining result data of a target modality corresponding to the data to be processed by processing the data to be processed with a target processing model; (Para. 12 see "The video feature vector and the audio feature vector are semantically aggregated in the semantic aggregation module to obtain a three-dimensional 2D time feature map, and the three-dimensional 2D time feature map is sent to an audio and video positioning predictor to obtain an audio and video positioning result."). wherein the target processing model comprises a multimodal submodel, and the pre-training task of the multimodal submodel includes a task of locating local data that matches second modal data from first modal data; (Para. 11 see "the voice and video positioning model includes… parallel video encoders and audio encoders." Para. 22 see "training on weakly supervised data sets through the MIL process to achieve voice and video positioning based on weak supervision."). wherein when the first modal data belongs to the original modality, the second modal data belongs to the target modality; (Para. 40 see "2D0-TAN is used to locate video content segments. For example, video segment positioning with natural language description and time action positioning in video. The former needs to locate the text according to the description sentence given by the user. Described video clip start and end time points, the latter needs to detect the presence of the action clip category and locate it at a given long video time point where the action starts and ends."). when the first modal data belongs to the target modality, the second modal data belongs to the original modality. (Para. 40 see "2D0-TAN is used to locate video content segments. For example, video segment positioning with natural language description and time action positioning in video. The former needs to locate the text according to the description sentence given by the user. Described video clip start and end time points, the latter needs to detect the presence of the action clip category and locate it at a given long video time point where the action starts and ends.").
Regarding claim 2, Zhang teaches The method of claim 1. wherein the pre-training process of the multimodal submodel comprises: constructing a first fusion feature according to a first feature of each first modal segment data in the first modal data; (Para. 46 see "The video feature vector and the audio feature vector are semantically aggregated in the semantic aggregation module to obtain a semantic aggregation feature vector."). encoding the first fusion feature with a second feature of the second modal data to obtain an encoding result; (Para. 67 see " the multimodal feature vector is passed into the time feature map module to obtain a three-dimensional 2D time feature map F, and the three-dimensional 2D time feature map includes A plurality of multi-modal feature vectors, the first two dimensions of the three-dimensional 2D time feature map are the time indexes corresponding to the start time and end time of the segment, the third dimension is the dimension corresponding to the segment feature."). predicting target segment data that matches the second modal data from each of the first modal segment data according to the encoding result; (Para. 72 see "obtain the predicted score K of the positioning feature vector. According to the positioning feature The prediction scoring of vector obtains the audio and video positioning result."). and pre-training the multimodal submodel according to the target segment data and label data corresponding to the second modal data (Paras. 76-78 see " Further, the training process is to train on a weakly supervised data set through the MIL process. Specifically, for each pair of matched audio (S)-video (V) files in the training samples, it is randomly reconstructed with another pair of audio (S')-video (V') files to obtain a pair of negative samples (S, V') and (S', V), the positive sample pair is (S, V), and the positioning score is calculated for two negative sample pairs and one positive sample pair. Further, the loss of the speech and video positioning model is calculated by linearly combining the binary cross-entropy loss function and the diversity loss function.").
Regarding claim 3, Zhang teaches The method of claim 2. wherein when each of the first modal segment data comprises video segment data, the first fusion feature is constructed based on at least one of:adjusting the order of each of the video segment data, and concatenating the first feature of each of the video segment data whose order has been adjusted; or sampling each of the video segment data, and concatenating the first feature of each of the sampled video segment data. (Para. 50 see "the time-series mean sampling layer performs segment-level feature extraction on the visual features according to the time series information of the video information to obtain segment video features." Para. 54 see "the QA encoding module gathers the time sequence information of each segment video feature to generate a video feature vector containing contextual semantic information.").
Regarding claim 4, Zhang teaches The method of claim 3. wherein the label data corresponding to the second modal data comprises: start and end frame position information of video segment data corresponding to the second modal data in the first modal data. (Para. 40 see "2D0-TAN is used to locate video content segments. For example, video segment positioning with natural language description and time action positioning in video. The former needs to locate the text according to the description sentence given by the user. Described video clip start and end time points, the latter needs to detect the presence of the action clip category and locate it at a given long video time point where the action starts and ends." Para. 67 see "the first two dimensions of the three-dimensional 2D time feature map are the time indexes corresponding to the start time and end time of the segment.").
Regarding claim 7, Zhang teaches The method of claim 1. wherein the target processing model is applied to at least one of: a video-based text locating task, a text-based video temporal locating task, a video- based text retrieval task, a text-based video retrieval task, a video-based text generation task, a text-based video generation task, a video question-answer task, or a video parsing task. (Para. 40 see "2D0-TAN is used to locate video content segments. For example, video segment positioning with natural language description and time action positioning in video. The former needs to locate the text according to the description sentence given by the user. Described video clip start and end time points, the latter needs to detect the presence of the action clip category and locate it at a given long video time point where the action starts and ends." Para. 67 see "the first two dimensions of the three-dimensional 2D time feature map are the time indexes corresponding to the start time and end time of the segment.").
Claim 8 is rejected under the same analysis as claim 1 above.
Claim 9 is rejected under the same analysis as claim 2 above.
Claim 10 is rejected under the same analysis as claim 3 above.
Claim 11 is rejected under the same analysis as claim 4 above.
Claim 14 is rejected under the same analysis as claim 7 above.
Claim 15 is rejected under the same analysis as claim 1 above.
Claim 16 is rejected under the same analysis as claim 2 above.
Claim 17 is rejected under the same analysis as claim 3 above.
Claim 18 is rejected under the same analysis as claim 4 above.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 5-6, 12-13, 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over Zhang (CN 115359398 A), hereinafter Zhang, in view of Lewis et al.: "BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension", Arxiv.org Cornell University Library, submitted 29 Oct 2019, [retrieved on 1-31-2026]. Retrieved from the internet <https://arxiv.org/abs/1910.13461>, hereinafter Lewis.
Regarding claim 5, Zhang teaches The method of claim 2. wherein when each of the first modal segment data comprises text segment data, (Para. 40 see " For example, video segment positioning with natural language description and time action positioning in video. The former needs to locate the text according to the description sentence given by the user." (Examiner note: The matches text to segments.)).
Zhang does not teach the first fusion feature is constructed based on at least one of: adjusting the order of each of the text segment data, and concatenating the first feature of each of the text segment data; or extracting a fragment token feature of each of the text segment data and aggregating each of the segment token features.
However, Lewis teaches the first fusion feature is constructed based on at least one of: adjusting the order of each of the text segment data, and concatenating the first feature of each of the text segment data; or extracting a fragment token feature of each of the text segment data and aggregating each of the segment token features. (Pg. 3, Col. 1, Para. 2 see "Sentence Permutation A document is divided into sentences based on full stops, and these sentences are shuffled in a random order.").
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhang to incorporate the teachings of Lewis to adjust the order of the text segment data. Doing so would predictably improve the text encoder’s ability to encode related text regardless of text segment order or length by forcing it to relate semantics of data instead of inherent order.
Regarding claim 6, Zhang in view of Lewis teaches The method of claim 5.
Zhang does not teach wherein the label data corresponding to the second modal data comprises: start and end character position information or segment ordering information of text segment data corresponding to the second modal data in the first modal data.
However, Lewis teaches wherein the label data corresponding to the second modal data comprises: start and end character position information or segment ordering information of text segment data corresponding to the second modal data in the first modal data. (Pg. 3, Col. 1, Para. 2 see "Sentence Permutation A document is divided into sentences based on full stops, and these sentences are shuffled in a random order." Pg. 2, Section 2.2, Para. 1 see "BART is trained by corrupting documents and then optimizing a reconstruction loss—the cross-entropy between the decoder’s output and the original document." (Examiner note: The original document is the label data, the sentences are the text segments which are shuffled. The document acts as the ground truth for segment ordering.)).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhang and Lewis to incorporate the teachings of Lewis to segment ordering information of text segment data of the label data. Doing so would predictably improve the model's ability to understand input text and locate positional information by training the model to predict the original segment based on semantics instead of inherent order.
Claim 12 is rejected under the same analysis as claim 5 above.
Claim 13 is rejected under the same analysis as claim 6 above.
Claim 19 is rejected under the same analysis as claim 5 above.
Claim 20 is rejected under the same analysis as claim 6 above.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Gao et al. (US 20200372116 A1) discloses Systems and methods for weakly supervised natural language localization using video-sentence pairs.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ALEXANDER J VAUGHN whose telephone number is (571) 272-5253. The examiner can normally be reached M-F 8:30-5.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, ANDREW MOYER can be reached on (571) 272-9523. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/ALEXANDER JOSEPH VAUGHN/Examiner, Art Unit 2675
/EDWARD PARK/Primary Examiner, Art Unit 2675