Last updated: May 29, 2026

Application No. 18/692,000

MULTI-MODAL PRE-TRAINING METHOD AND MULTI-MODAL PRE-TRAINING APPARATUS

Non-Final OA §102§103

Filed

Mar 14, 2024

Priority

Sep 15, 2021 — CN 202111078728.2 +1 more

Examiner

MAIDEN, MICHAEL KIM

Art Unit

2665

Tech Center

2600 — Communications

Assignee

BEIJING JINGDONG SHANGKE INFORMATION TECHNOLOGY CO., LTD.

OA Round

1 (Non-Final)

Interview Optional

— +10.0% interview lift. Interview lift (+10.0%) is below the 15.0% threshold. A written response is recommended.

Based on 77 resolved cases, 2023–2026

Examiner Intelligence

MAIDEN, MICHAEL KIM View full profile →

Grants 92% — above average

Career Allowance Rate

71 granted / 77 resolved

+30.2% vs TC avg

Moderate +10% lift

Without

With

+10.0%

Interview Lift

resolved cases with interview

Typical timeline

2y 8m

Avg Prosecution

10 currently pending

Career history

Total Applications

across all art units

Statute-Specific Performance

§101

5.6%

-34.4% vs TC avg

§103

86.0%

+46.0% vs TC avg

§102

5.6%

-34.4% vs TC avg

§112

0.9%

-39.1% vs TC avg

Black line = Tech Center average estimate • Based on career data from 77 resolved cases

Office Action

§102 §103

DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Priority
	Acknowledgement is made of the application’s status as a continuation of CN202111078728

Information Disclosure Statement
	The information disclosure statement (IDS) were submitted on 06/12/2024, 11/06/2024, and 02/25/2026. The submission is in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner. 

Claim Status
	Claim(s) 1 is rejected under 35 U.S.C. 102(a)(1) as being anticipated by Li (Li, L., Chen, Y. C., Cheng, Y., Gan, Z., Yu, L., & Liu, J. (2020, November). Hero: Hierarchical encoder for video+ language omni-representation pre-training. In Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP) (pp. 2046-2065)).
Claim(s) 12-13 are rejected under 35 U.S.C. 103 as being unpatentable over Li (Li, L., Chen, Y. C., Cheng, Y., Gan, Z., Yu, L., & Liu, J. (2020, November). Hero: Hierarchical encoder for video+ language omni-representation pre-training. In Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP) (pp. 2046-2065)) in view of Yu (US 20220019744 A1).
Claims 2-10 and 14-21 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.


Claim(s) 1 is rejected under 35 U.S.C. 102(a)(1) as being anticipated by Li (Li, L., Chen, Y. C., Cheng, Y., Gan, Z., Yu, L., & Liu, J. (2020, November). Hero: Hierarchical encoder for video+ language omni-representation pre-training. In Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP) (pp. 2046-2065)). 

Regarding claim 1, discloses A multi-modal pre-training method, comprising: (Section 1 “we present a new video-and-language large-scale pre-training frame work- HERO…HERO encodes multimodal inputs”)
sampling a video in a video-text pair to obtain a first video frame sequence; (Section 3.1 “takes the frames of a video clip and the textual tokens of subtitle sentences as inputs.”)
performing word segmentation processing on a text in the video-text pair to obtain a first word segmentation sequence; (Section 3.2.1 “The inputs for MLM include: (i) sub-word tokens from the i-th subtitle sentence w-si.” Linjie disclose processing text to function as textual tokens of subtitle sentences )
masking on the first video frame sequence to obtain a second video frame sequence; (Section 3.2.2 “Similar to MLM, we also sample frames and mask their visual  features with a probability of 15%”)
masking on the first word segmentation sequence to obtain a second word segmentation sequence; (Section 3.2.1 “In MLM, we randomly mask out input words with a probability of 15%, and replace the masked tokens”.  Figure 1 discloses an example of a word segmentation sequence where a word has been substituted with a discrete label)
encoding the first video frame sequence to obtain a first video feature, (Section 3.2.3 “The inputs to VSM are:… the whole video clip v” Section 1 discloses Hero encodes the multimodal inputs by using a Cross-modal Transformer. Figure discloses the frame sequence being passed to the Cross-modal Transformer) and encoding the first word segmentation sequence to obtain a first word segmentation feature; (Section 3.2.2 and Figure 1 disclose all the subtitle sentence are passed to the Cross-modal Transformer )
encoding the second video frame sequence to obtain a second video feature, (Section 3.2.2 and Figure 1 disclose passing the masked frame sequence to the Cross-modal Transformer) and encoding the second word segmentation sequence to obtain a second word segmentation feature; (Section 3.2.1 and Figure 1 disclose inputting the masked text into the Cross-modal Transformer)
determining a pre-trained target function by using the first video feature, the first word segmentation feature, the second video feature and the second word segmentation feature; and (Section 3.2.3 discloses the final loss Lvsm which makes use of negative and positive video-sentence pairs)
performing multi-modal pre-training by using the pre-trained target function. (Section 3.2.3 discloses the use of final loss Lvsm which makes use of negative and positive video-sentence pairs)

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 12-13 are rejected under 35 U.S.C. 103 as being unpatentable over Li (Li, L., Chen, Y. C., Cheng, Y., Gan, Z., Yu, L., & Liu, J. (2020, November). Hero: Hierarchical encoder for video+ language omni-representation pre-training. In Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP) (pp. 2046-2065)) in view of Yu (US 20220019744 A1).

Regarding claim 12, Li discloses A multi-modal pre-training apparatus, comprising: (Li: Section 1 “we present a new video-and-language large-scale pre-training frame work- HERO…HERO encodes multimodal inputs”)
which is configured to execute a multi-modal pre- training method comprising: (Li: Section 1 “we present a new video-and-language large-scale pre-training frame work- HERO…HERO encodes multimodal inputs”)
sampling a video in a video-text pair to obtain a first video frame sequence; (Li: Section 3.1 “takes the frames of a video clip and the textual tokens of subtitle sentences as inputs.”)
performing word segmentation processing on a text in the video-text pair to obtain a first word segmentation sequence; (Li: Section 3.2.1 “The inputs for MLM include: (i) sub-word tokens from the i-th subtitle sentence w-si.” Linjie disclose processing text to function as textual tokens of subtitle sentences )
masking on the first video frame sequence to obtain a second video frame sequence; (Li: Section 3.2.2 “Similar to MLM, we also sample frames and mask their visual  features with a probability of 15%”)
masking on the first word segmentation sequence to obtain a second word segmentation sequence; (Li: Section 3.2.1 “In MLM, we randomly mask out input words with a probability of 15%, and replace the masked tokens”.  Figure 1 discloses an example of a word segmentation sequence where a word has been substituted with a discrete label)
encoding the first video frame sequence to obtain a first video feature, (Li: Section 3.2.3 “The inputs to VSM are:… the whole video clip v” Section 1 discloses Hero encodes the multimodal inputs by using a Cross-modal Transformer. Figure discloses the frame sequence being passed to the Cross-modal Transformer) and encoding the first word segmentation sequence to obtain a first word segmentation feature; (Li: Section 3.2.2 and Figure 1 disclose all the subtitle sentence are passed to the Cross-modal Transformer )
encoding the second video frame sequence to obtain a second video feature, (Li: Section 3.2.2 and Figure 1 disclose passing the masked frame sequence to the Cross-modal Transformer) and encoding the second word segmentation sequence to obtain a second word segmentation feature; (Li: Section 3.2.1 and Figure 1 disclose inputting the masked text into the Cross-modal Transformer)
determining a pre-trained target function by using the first video feature, the first word segmentation feature, the second video feature and the second word segmentation feature; and (Li: Section 3.2.3 discloses the final loss Lvsm which makes use of negative and positive video-sentence pairs)
performing multi-modal pre-training by using the pre-trained target function. (Li: Section 3.2.3 discloses the use of final loss Lvsm which makes use of negative and positive video-sentence pairs)
Li fails to specifically disclose a memory; and 
a processor coupled to the memory,
In related art, Yu discloses a memory; and (Yu: ¶7 “an electronic device includes: at least one processor; and a memory in communication connection with the at least one processor;”)
a processor coupled to the memory, (Yu: ¶7 “an electronic device includes: at least one processor; and a memory in communication connection with the at least one processor;”)
Therefore, it would have been obvious to for one of ordinary skill in the art before the effective filing date to incorporate the memory and processor disclosed by Yu into the method of multi-modal pre-training disclosed by Li to store and manipulate the data housed within the computing system.

Regarding claim 13, Li discloses performs a multi-modal pre-training method comprising: (Li: Section 1 “we present a new video-and-language large-scale pre-training frame work- HERO…HERO encodes multimodal inputs”)
sampling a video in a video-text pair to obtain a first video frame sequence; (Li: Section 3.1 “takes the frames of a video clip and the textual tokens of subtitle sentences as inputs.”)
performing word segmentation processing on a text in the video-text pair to obtain a first word segmentation sequence; (Li: Section 3.2.1 “The inputs for MLM include: (i) sub-word tokens from the i-th subtitle sentence w-si.” Linjie disclose processing text to function as textual tokens of subtitle sentences )
masking on the first video frame sequence to obtain a second video frame sequence; (Li: Section 3.2.2 “Similar to MLM, we also sample frames and mask their visual  features with a probability of 15%”)
masking on the first word segmentation sequence to obtain a second word segmentation sequence; (Li: Section 3.2.1 “In MLM, we randomly mask out input words with a probability of 15%, and replace the masked tokens”.  Figure 1 discloses an example of a word segmentation sequence where a word has been substituted with a discrete label)
encoding the first video frame sequence to obtain a first video feature, (Li: Section 3.2.3 “The inputs to VSM are:… the whole video clip v” Section 1 discloses Hero encodes the multimodal inputs by using a Cross-modal Transformer. Figure discloses the frame sequence being passed to the Cross-modal Transformer) and encoding the first word segmentation sequence to obtain a first word segmentation feature; (Li: Section 3.2.2 and Figure 1 disclose all the subtitle sentence are passed to the Cross-modal Transformer )
encoding the second video frame sequence to obtain a second video feature, (Li: Section 3.2.2 and Figure 1 disclose passing the masked frame sequence to the Cross-modal Transformer) and encoding the second word segmentation sequence to obtain a second word segmentation feature; (Li: Section 3.2.1 and Figure 1 disclose inputting the masked text into the Cross-modal Transformer)
determining a pre-trained target function by using the first video feature, the first word segmentation feature, the second video feature and the second word segmentation feature; and (Li: Section 3.2.3 discloses the final loss Lvsm which makes use of negative and positive video-sentence pairs)
performing multi-modal pre-training by using the pre-trained target function. (Li: Section 3.2.3 discloses the use of final loss Lvsm which makes use of negative and positive video-sentence pairs)
Li fails to specifically disclose A non-transitory computer-readable storage medium, which stores a computer program that, when executed by a processor,
In related art, Yu discloses A non-transitory computer-readable storage medium, which stores a computer program that, when executed by a processor, (Yu: ¶8 “a non-transitory computer-readable storage medium includes instructions, which, when executed by a computer, cause the computer to carry out the method as described above.”)
Therefore, it would have been obvious to for one of ordinary skill in the art before the effective filing date to incorporate the non-transitory computer readable medium for executing multi-modal pre-training disclosed by Yu into the method of multi-modal pre-training disclosed by Li to store and manipulate the data housed within the computing system.

Allowable Subject Matter
Claims 2-10 and 14-21 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure: 
	Li ‘767 (US 20240054767 A1) discloses a multi-modal model training method, apparatus and device, and a storage medium. The method includes the following steps: obtaining a training sample set, and training a multi-modal model for a plurality of rounds by successively using each of training sample pair in the training sample set: during use of any one of the training sample pairs for training, obtaining an image feature of a target visual sample firstly, and then determining whether back translation needs to be performed on a target original text; when back translation needs to be performed on the target original text, performing corresponding back translation to obtain a target back-translated text, and obtaining a text feature of the target back-translated text; and training the multi-modal model based on the image feature and the text feature.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to MICHAEL KIM MAIDEN whose telephone number is (703)756-1264. The examiner can normally be reached Monday - Friday 7:30 am - 5:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Stephen Koziol can be reached at 4089187630. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/MICHAEL KIM MAIDEN/Examiner, Art Unit 2665                                                                                                                                                                                                        
/Stephen R Koziol/Supervisory Patent Examiner, Art Unit 2665

Read full office action

Prosecution Timeline

Mar 14, 2024

Application Filed

Mar 10, 2026

Non-Final Rejection mailed — §102, §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/363,433

Patent 12632973

SYSTEM AND METHOD FOR TARGETING FROM 3D DIGITAL SURFACE MODELS AND DIGITAL POINT POSITIONING DATABASE CONTROLLED STEREO IMAGERY

2y 9m to grant Granted May 19, 2026

17/801,573

Patent 12616184

REAL-TIME MONITORING AND EARLY DETECTION SYSTEM FOR INSECT ACTIVITY IN GRAINS DURING STORAGE

3y 8m to grant Granted May 05, 2026

18/572,118

Patent 12620171

REPRESENTATIONS OF FOOT FEATURES

2y 4m to grant Granted May 05, 2026

18/116,686

Patent 12614369

Machine-Learning Models Trained to Modify Image Illumination Without Ground-Truth Images

3y 1m to grant Granted Apr 28, 2026

18/190,510

Patent 12608798

IMAGE PROCESSING DEVICE, IMAGE PROCESSING SYSTEM, IMAGE DISPLAY METHOD, AND IMAGE PROCESSING PROGRAM

3y 0m to grant Granted Apr 21, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

1-2

Expected OA Rounds

92%

Grant Probability

99%

With Interview (+10.0%)

2y 8m (~6m remaining)

Median Time to Grant

Low

PTA Risk

Based on 77 resolved cases by this examiner. Grant probability derived from career allowance rate.