Last updated: April 19, 2026

Application No. 18/408,967

PROCESSING METHOD, ELECTRONIC DEVICE, AND NON-TRANSITORY COMPUTER-READABLE STORAGE MEDIUM FOR MULTIMODAL DATA

Non-Final OA §102§103

Filed

Jan 10, 2024

Examiner

VAUGHN, ALEXANDER JOSEPH

Art Unit

2675

Tech Center

2600 — Communications

Assignee

BEIJING ZITIAO NETWORK TECHNOLOGY CO., LTD.

OA Round

1 (Non-Final)

Interview Optional

— +28.6% interview lift. This examiner has a relatively high allow rate; a written response may suffice.

Based on 15 resolved cases, 2023–2026

Examiner Intelligence

VAUGHN, ALEXANDER JOSEPH View full profile →

Grants 73% — above average

Career Allow Rate

11 granted / 15 resolved

+11.3% vs TC avg

Strong +29% interview lift

Without

With

+28.6%

Interview Lift

resolved cases with interview

Typical timeline

2y 10m

Avg Prosecution

20 currently pending

Career history

Total Applications

across all art units

Statute-Specific Performance

§101

6.3%

-33.7% vs TC avg

§103

52.5%

+12.5% vs TC avg

§102

30.0%

-10.0% vs TC avg

§112

11.3%

-28.7% vs TC avg

Black line = Tech Center average estimate • Based on career data from 15 resolved cases

Office Action

§102 §103

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.



	Claims 1-4, 7-11, 14-18 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Zhang (CN 115359398 A), hereinafter Zhang.

	Regarding claim 1, Zhang teaches A processing method for multimodal data, comprising: obtaining data to be processed of an original modality; (Abstract see "The invention provides a voice and video positioning model and a construction method, device and application thereof." Para. 10 see "Acquiring at least one audio-video file, marking the audio information and video information of the audio-video file." Para. 11 see " the voice and video positioning model includes… parallel video encoders and audio encoders." Para. 19 see " an electronic device, including a memory and a processor, the memory stores a computer program, and the processor is configured to run the computer program to execute a voice and video positioning model A construction method or a voice and video positioning method. "). and determining result data of a target modality corresponding to the data to be processed by processing the data to be processed with a target processing model; (Para. 12 see "The video feature vector and the audio feature vector are semantically aggregated in the semantic aggregation module to obtain a three-dimensional 2D time feature map, and the three-dimensional 2D time feature map is sent to an audio and video positioning predictor to obtain an audio and video positioning result."). wherein the target processing model comprises a multimodal submodel, and the pre-training task of the multimodal submodel includes a task of locating local data that matches second modal data from first modal data; (Para. 11 see "the voice and video positioning model includes… parallel video encoders and audio encoders." Para. 22 see "training on weakly supervised data sets through the MIL process to achieve voice and video positioning based on weak supervision."). wherein when the first modal data belongs to the original modality, the second modal data belongs to the target modality; (Para. 40 see "2D0-TAN is used to locate video content segments. For example, video segment positioning with natural language description and time action positioning in video. The former needs to locate the text according to the description sentence given by the user. Described video clip start and end time points, the latter needs to detect the presence of the action clip category and locate it at a given long video time point where the action starts and ends."). when the first modal data belongs to the target modality, the second modal data belongs to the original modality. (Para. 40 see "2D0-TAN is used to locate video content segments. For example, video segment positioning with natural language description and time action positioning in video. The former needs to locate the text according to the description sentence given by the user. Described video clip start and end time points, the latter needs to detect the presence of the action clip category and locate it at a given long video time point where the action starts and ends."). 

	Regarding claim 2, Zhang teaches The method of claim 1. wherein the pre-training process of the multimodal submodel comprises: constructing a first fusion feature according to a first feature of each first modal segment data in the first modal data; (Para. 46 see "The video feature vector and the audio feature vector are semantically aggregated in the semantic aggregation module to obtain a semantic aggregation feature vector."). encoding the first fusion feature with a second feature of the second modal data to obtain an encoding result; (Para. 67 see " the multimodal feature vector is passed into the time feature map module to obtain a three-dimensional 2D time feature map F, and the three-dimensional 2D time feature map includes A plurality of multi-modal feature vectors, the first two dimensions of the three-dimensional 2D time feature map are the time indexes corresponding to the start time and end time of the segment, the third dimension is the dimension corresponding to the segment feature."). predicting target segment data that matches the second modal data from each of the first modal segment data according to the encoding result; (Para. 72 see "obtain the predicted score K of the positioning feature vector. According to the positioning feature The prediction scoring of vector obtains the audio and video positioning result."). and pre-training the multimodal submodel according to the target segment data and label data corresponding to the second modal data (Paras. 76-78 see " Further, the training process is to train on a weakly supervised data set through the MIL process. Specifically, for each pair of matched audio (S)-video (V) files in the training samples, it is randomly reconstructed with another pair of audio (S')-video (V') files to obtain a pair of negative samples (S, V') and (S', V), the positive sample pair is (S, V), and the positioning score is calculated for two negative sample pairs and one positive sample pair. Further, the loss of the speech and video positioning model is calculated by linearly combining the binary cross-entropy loss function and the diversity loss function."). 

	Regarding claim 3, Zhang teaches The method of claim 2. wherein when each of the first modal segment data comprises video segment data, the first fusion feature is constructed based on at least one of:adjusting the order of each of the video segment data, and concatenating the first feature of each of the video segment data whose order has been adjusted; or sampling each of the video segment data, and concatenating the first feature of each of the sampled video segment data. (Para. 50 see "the time-series mean sampling layer performs segment-level feature extraction on the visual features according to the time series information of the video information to obtain segment video features." Para. 54 see "the QA encoding module gathers the time sequence information of each segment video feature to generate a video feature vector containing contextual semantic information."). 

	Regarding claim 4, Zhang teaches The method of claim 3. wherein the label data corresponding to the second modal data comprises: start and end frame position information of video segment data corresponding to the second modal data in the first modal data. (Para. 40 see "2D0-TAN is used to locate video content segments. For example, video segment positioning with natural language description and time action positioning in video. The former needs to locate the text according to the description sentence given by the user. Described video clip start and end time points, the latter needs to detect the presence of the action clip category and locate it at a given long video time point where the action starts and ends." Para. 67 see "the first two dimensions of the three-dimensional 2D time feature map are the time indexes corresponding to the start time and end time of the segment."). 

	Regarding claim 7, Zhang teaches The method of claim 1. wherein the target processing model is applied to at least one of: a video-based text locating task, a text-based video temporal locating task, a video- based text retrieval task, a text-based video retrieval task, a video-based text generation task, a text-based video generation task, a video question-answer task, or a video parsing task. (Para. 40 see "2D0-TAN is used to locate video content segments. For example, video segment positioning with natural language description and time action positioning in video. The former needs to locate the text according to the description sentence given by the user. Described video clip start and end time points, the latter needs to detect the presence of the action clip category and locate it at a given long video time point where the action starts and ends." Para. 67 see "the first two dimensions of the three-dimensional 2D time feature map are the time indexes corresponding to the start time and end time of the segment."). 

	Claim 8 is rejected under the same analysis as claim 1 above.
	Claim 9 is rejected under the same analysis as claim 2 above.
	Claim 10 is rejected under the same analysis as claim 3 above.
	Claim 11 is rejected under the same analysis as claim 4 above.
	Claim 14 is rejected under the same analysis as claim 7 above.
	Claim 15 is rejected under the same analysis as claim 1 above.
	Claim 16 is rejected under the same analysis as claim 2 above.
	Claim 17 is rejected under the same analysis as claim 3 above.
	Claim 18 is rejected under the same analysis as claim 4 above.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.



	Claims 5-6, 12-13, 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over Zhang (CN 115359398 A), hereinafter Zhang, in view of Lewis et al.: "BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension", Arxiv.org Cornell University Library, submitted 29 Oct 2019, [retrieved on 1-31-2026]. Retrieved from the internet <https://arxiv.org/abs/1910.13461>, hereinafter Lewis.

	Regarding claim 5, Zhang teaches The method of claim 2. wherein when each of the first modal segment data comprises text segment data, (Para. 40 see " For example, video segment positioning with natural language description and time action positioning in video. The former needs to locate the text according to the description sentence given by the user." (Examiner note: The matches text to segments.)). 
	Zhang does not teach the first fusion feature is constructed based on at least one of: adjusting the order of each of the text segment data, and concatenating the first feature of each of the text segment data; or extracting a fragment token feature of each of the text segment data and aggregating each of the segment token features. 
	However, Lewis teaches the first fusion feature is constructed based on at least one of: adjusting the order of each of the text segment data, and concatenating the first feature of each of the text segment data; or extracting a fragment token feature of each of the text segment data and aggregating each of the segment token features. (Pg. 3, Col. 1, Para. 2 see "Sentence Permutation A document is divided into sentences based on full stops, and these sentences are shuffled in a random order."). 
	It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhang to incorporate the teachings of Lewis to adjust the order of the text segment data. Doing so would predictably improve the text encoder’s ability to encode related text regardless of text segment order or length by forcing it to relate semantics of data instead of inherent order.

	Regarding claim 6, Zhang in view of Lewis teaches The method of claim 5. 
	Zhang does not teach wherein the label data corresponding to the second modal data comprises: start and end character position information or segment ordering information of text segment data corresponding to the second modal data in the first modal data. 
	However, Lewis teaches wherein the label data corresponding to the second modal data comprises: start and end character position information or segment ordering information of text segment data corresponding to the second modal data in the first modal data. (Pg. 3, Col. 1, Para. 2 see "Sentence Permutation A document is divided into sentences based on full stops, and these sentences are shuffled in a random order." Pg. 2, Section 2.2, Para. 1 see "BART is trained by corrupting documents and then optimizing a reconstruction loss—the cross-entropy between the decoder’s output and the original document." (Examiner note: The original document is the label data, the sentences are the text segments which are shuffled. The document acts as the ground truth for segment ordering.)). 
	It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Zhang and Lewis to incorporate the teachings of Lewis to segment ordering information of text segment data of the label data. Doing so would predictably improve the model's ability to understand input text and locate positional information by training the model to predict the original segment based on semantics instead of inherent order.

	Claim 12 is rejected under the same analysis as claim 5 above.
	Claim 13 is rejected under the same analysis as claim 6 above.
	Claim 19 is rejected under the same analysis as claim 5 above.
	Claim 20 is rejected under the same analysis as claim 6 above.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Gao et al. (US 20200372116 A1) discloses Systems and methods for weakly supervised natural language localization using video-sentence pairs.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ALEXANDER J VAUGHN whose telephone number is (571) 272-5253. The examiner can normally be reached M-F 8:30-5.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, ANDREW MOYER can be reached on (571) 272-9523. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/ALEXANDER JOSEPH VAUGHN/Examiner, Art Unit 2675                                                                                                                                                                                                        

/EDWARD PARK/Primary Examiner, Art Unit 2675

Read full office action

Prosecution Timeline

Jan 10, 2024

Application Filed

Jan 31, 2026

Non-Final Rejection — §102, §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/306,693

Patent 12591955

SYSTEMS AND METHODS FOR GENERATING DYNAMIC DARK CURRENT IMAGES

2y 5m to grant Granted Mar 31, 2026

17/947,889

Patent 12579756

GRAPHICAL ASSISTANCE WITH TASKS USING AN AR WEARABLE DEVICE

2y 5m to grant Granted Mar 17, 2026

18/306,339

Patent 12573010

IMAGE PROCESSING APPARATUS, RADIATION IMAGING SYSTEM, IMAGE PROCESSING METHOD, AND NON-TRANSITORY COMPUTER-READABLE STORAGE MEDIUM

2y 5m to grant Granted Mar 10, 2026

18/081,060

Patent 12567265

VEHICLE, CONTROL METHOD THEREOF AND CAMERA MONITORING APPARATUS

2y 5m to grant Granted Mar 03, 2026

18/286,162

Patent 12521061

Method of Determining the Effectiveness of a Treatment on a Face

2y 5m to grant Granted Jan 13, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

1-2

Expected OA Rounds

73%

Grant Probability

99%

With Interview (+28.6%)

2y 10m

Median Time to Grant

Low

PTA Risk

Based on 15 resolved cases by this examiner. Grant probability derived from career allow rate.