Last updated: May 29, 2026

Application No. 18/327,192

MULTIMODAL DEEP LEARNING WITH BOOSTED TREES

Non-Final OA §103

Filed

Jun 01, 2023

Examiner

LEY, SALLY THI

Art Unit

2147

Tech Center

2100 — Computer Architecture & Software

Assignee

International Business Machines Corporation

OA Round

1 (Non-Final)

This examiner grants 19% of cases after interview

— +33.3% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.

Based on 36 resolved cases, 2023–2026

Examiner Intelligence

LEY, SALLY THI View full profile →

Grants only 19% of cases

Career Allowance Rate

7 granted / 36 resolved

-35.6% vs TC avg

Strong +33% interview lift

Without

With

+33.3%

Interview Lift

resolved cases with interview

Typical timeline

4y 8m

Avg Prosecution

17 currently pending

Career history

Total Applications

across all art units

Statute-Specific Performance

§101

10.3%

-29.7% vs TC avg

§103

83.2%

+43.2% vs TC avg

§102

3.8%

-36.2% vs TC avg

§112

2.7%

-37.3% vs TC avg

Black line = Tech Center average estimate • Based on career data from 36 resolved cases

Office Action

§103

DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Status of Claims
	This Office Action is in response to the communication filed on 01 Jun 2023. 
	Claims 1-20 are being considered on the merits.
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 01 Jun 2023 has been considered. The submission is in compliance with the provisions of 37 CFR 1.97. Accordingly, initialed and dated copies of Applicant's IDS form 1499 is attached to the instant Office action. 
Drawings
	The drawings filed on 01 Jun 2023 are accepted.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-20 are rejected under 35 U.S.C. 103 as being unpatentable over Wang, et. al. (arXiv:2204.10496v2 [cs.CV]; hereinafter, “Wang”) in view of Wu, et. al. (US11605228; hereinafter, “Wu”) and further in view of Ke, et al. (DeepGBM: A Deep Learning Framework Distilled by GBDT for Online Prediction Tasks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD '19). Association for Computing Machinery, New York, NY, USA, 384–394. https://doi.org/10.1145/3292500.3330858; hereinafter, “Ke”)

Regarding claims 1, 8, and 15, Wang, as modified teaches: 
A computer-implemented method for a multimodal neural network comprising: (Wu, col. 11:45-47: “Some of the above embodiments, as applicable, may be implemented using a variety of different information processing systems.”)
A system having a memory, (Wu, col. 12:28-31: “All or some of the software described herein may be received elements of system 600, for example, from computer readable media such as memory 670 or other media on other computer systems.”) computer readable instructions, and one or more processors for executing the computer readable instructions, the computer readable instructions controlling the one or more processors to perform operations comprising: (Wu, col. 12:46-48: “A computer system processes information according to a program and produces resultant output information via I/O devices. A program is a list of instructions such as a particular application program and/or an operating system.”)
A computer program product comprising a computer readable storage medium (Wu, 12:28-31: “All or some of the software described herein may be received elements of system 600, for example, from computer readable media such as memory 670 or other media on other computer systems.”) having program instructions embodied therewith, the program instructions executable by one or more processors to cause the one or more processors to perform operations comprising: (Wu, 12:46-48: “A computer system processes information according to a program and produces resultant output information via I/O devices. A program is a list of instructions such as a particular application program and/or an operating system.”)
training a plurality of unimodal models, (Wu, col 1:24-28, col. 7:52-56 and Fig 5D: “In such an approach, a single or multiple neural network is trained to process sensory data for various tasks such as detecting and classifying objects and segmenting pixels, voxels, or points reported by sensors into individual groups with respect to classification types or identities.” “FIG. 5D illustrates a hybrid fusion architecture with a subset of sensors 550 having the same modality (e.g., camera 1 and camera 2) having overlapping fields of view (FOV1 and FOV2), and a sensor having a different modality (e.g., LiDAR) and field of view. “)
wherein each unimodal teacher model of the plurality of unimodal teacher models is trained using training data from a unique modality of a plurality of modalities; (Wang pg. 3: “In these frameworks, vision, and text streams are all modality-specific encoders and cross-modal fusion layers do not exist but shallow connections via cosine distance.”) 
for each of the unimodal teacher models, training a respective student encoder of a plurality of student encoders using a knowledge distillation whereby one or more features for each respective student encoder are forced to a same feature of the respective unimodal teacher model; (Wang, pg. 5: “In conventional knowledge distillation [32,60,55,33], a larger model serves as the teacher to a smaller student model…A wide range of works have explored to supervise the student model by mimicking different components of the teacher, e.g. the distribution of self-attention, intermediate representations of transformer blocks, last layer features, etc. to increase performances [59,50,8]” Examiner notes Wang teaches unimodal models and encoders at page 3, as set forth above). 
feeding a concatenation of outputs from the plurality of student encoders to train a fusion neural network of the multimodal neural network; (Wang, pgs. 6 and 8: “Fig.2. Structure diagram of Multimodal Distillation (MD) and Multimodal Adaptive Distillation (MAD). MAD is further improved on top of MD and MD is equivalent to MAD in terms of structure when both Confidence Weighting and Token Selection are removed.”  “Thus during the Adaptive Finetuning, multimodal distilled knowledge can help align the vision and language modality features beforehand.” Examiner notes that Wang teaches using a concatenation of outputs from student model and later finetuning a multimodal model). 
receiving data from the plurality of modalities; and (Wang, pg. 4: “Currently, there is a lack of a more general framework for fusing large-scale pretrained vision and text models into a pretrained VL structure for improving downstream VL tasks while controlling computational complexity. Existing fusion methods, such as shallow contrastive frameworks, that avoid impacting computation complexity mostly focus on traditional VL tasks such as object classification, image captioning, etc. To our best knowledge, no prior VLP works from this tier have studied the impact on generalization capability by assessing performance on highly-semantic tasks, such as VCR, especially under both low shot and domain shifted scenarios, which are more reflective of the challenges encountered in practice. In this work, we address several of these mentioned gaps.”) 
generating a prediction from an output layer of the trained fusion neural network. (Wang, pg. 20: “We submitted our single model test prediction results to the VCR public leaderboard, which is currently listed as the 12th entry overall (including ensemble models and models without any reference or publi cation) and achieves State-Of-The-Art (SOTA) performance on VCR compared to other public single models 6 that are pretrained with image-text data.”) 
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the teachings of Wu into Wang. Wang teaches leveraging unimodal vision and text encoders for Vision-Language tasks that augment existing Vision-Language approaches while conserving computational complexity. Wu teaches an early fusion network is provided that reduces network load. One of ordinary skill would have been motivated to combine the teachings of Wu into Wang in order to enable easier design of specialized ASIC edge processors through performing a portion of convolutional neural network layers at distributed edge and data-network processors (Wu, abstract). 
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the teachings of Ke into Wang, as modified. Ke teaches DeepGBM, which integrates the advantages of the both NN and GBDT by using two corresponding NN components. One of ordinary skill would have been motivated to combine the teachings of Ke into Wang, as modified in order to leverage both categorical and numerical features while retaining the ability of efficient online update (Ke, abstract). 

Regarding claims 2, 9, and 16, Wang, as modified teaches: 
wherein each unimodal teacher model is uniquely coupled, directly or indirectly, to a student encoder. (Wang pg. 5: “A wide range of works have explored to supervise the student model by mimicking different components of the teacher, e.g. the distribution of self-attention, intermediate representations of transformer blocks, last layer features, etc. to increase performances [59,50,8]” Examiner notes Wang teaches a particular student model, “the student model”, mimicking a particular teacher, “the teacher” and teaches use of encoders on page 3).

Regarding claims 3, 10, and 17, Wang, as modified teaches: 
wherein the plurality of unimodal teacher models comprises a convolutional neural network (CNN) (Wu, col 2:20-25: “Embodiments of the present invention provide an early fusion network that reduces network load through performing a portion of convolutional neural network layers at distributed edge and data-network processors prior to transmitting data to a centralized processor for fully-connected/deconvolutional neural networking processing.”) teacher and a gradient boosted decision tree (GBDT) teacher. (Ke, pg. 387: “In this section, we will elaborate on how the new proposed learning framework, DeepGBM, integrates NN and GBDT together to obtain a more effective model for generic online prediction tasks” Examiner notes Ke teaches integrated use of a generic NN and GBDT together, where Wu teaches specifically a convolutional neural network) .
	It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the teachings of Ke into Wang, as modified. Ke teaches DeepGBM, which integrates the advantages of the both NN and GBDT by using two corresponding NN components. One of ordinary skill would have been motivated to combine the teachings of Ke into Wang, as modified in order to leverage both categorical and numerical features while retaining the ability of efficient online update (Ke, abstract). 
Regarding claims 4, 11, and 18, Wang, as modified teaches: 
wherein the CNN teacher is coupled to an image student encoder (Wu, col 4:34-37: “As an example of the distributed processing of embodiments, a preprocessor can project the lidar data to different views with each view processed by a different convolutional neural network (CNN)”) and the GBDT teacher is coupled to a sensor student encoder. (Wu, col. 7:19-24: “ FIG. 5B illustrates a sensor fusion system architecture having uni-modal sensors (e.g., cameras) each having a separate, but overlapping, field of view. Edge sensors 520 perform preprocessing and feature extraction and the extracted features from each sensor module are transported over network 525 to central compute server 530.” Examiner note Wu teaches sensor data being processed at the edge itself while Wang at pg 3 teaches use of encoders, generally).
Regarding claims 5, 12, and 19, Wang, as modified teaches:
further comprising generating, from each unimodal teacher model, a soft-label output. (Wang, pg. 23 and table S5: “VQA is different from VCR and SNLI-VE in terms of number of answer labels for each sample data. For every image-question pair, VCR provides four answer choices and for every premise-hypothesis pair, only three answer labels are provided” Examiner notes Wang teaches a soft-label output including a top label candidate).  

Regarding claims 6, 13, and 20, Wang, as modified teaches:
further comprising enforcing knowledge distillation on the output layer by forcing the output layer of the multimodal neural network to approximate an aggregated probability output of the plurality of unimodal teacher models via a loss term of an objective function of the multimodal neural network. (Wang, pg. 6: “Both the teacher and student model’s visual img5 tokens, in addition to the teacher’s text eos token and the student’s text cls token, are compared via L1 measure. The final loss is the weighted distillation loss Ld summed with the original task loss Lt for any specific downstream task”)
	It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the teachings of Ke into Wang, as modified. Ke teaches DeepGBM, which integrates the advantages of the both NN and GBDT by using two corresponding NN components. One of ordinary skill would have been motivated to combine the teachings of Ke into Wang, as modified in order to leverage both categorical and numerical features while retaining the ability of efficient online update (Ke, abstract). 
Regarding claims 7 and 14, Wang, as modified teaches: 
wherein knowledge distillation comprises training a neural network with one or more tree features to approximate a tree group structure of a gradient boosted decision tree (GBDT) teacher. (Ke, pg. 387: “Fortunately, as NN has been proven powerful enough to approximate any functions [19], we can use an NN model to approximate the function of the tree structure and achieve the structure knowledge distillation. Therefore, as illustrated in Fig.2, we can use NN to fit the cluster results pro duced by the tree, to let NN approximate the structure function of decision tree”)
	It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the teachings of Ke into Wang, as modified. Ke teaches DeepGBM, which integrates the advantages of the both NN and GBDT by using two corresponding NN components. One of ordinary skill would have been motivated to combine the teachings of Ke into Wang, as modified in order to leverage both categorical and numerical features while retaining the ability of efficient online update (Ke, abstract). 
Prior Art
Yu, et. al. (“Learning Modality-Specific Representations with Self-Supervised Multi-Task Learning for Multimodal Sentiment Analysis.” Proceedings of the AAAI Conference on Artificial Intelligence, 35(12), 10790-10797.) teaches a label generation module based on the self-supervised learning strategy to acquire independent unimodal supervisions followed by joint training the multimodal and uni-modal tasks to learn the consistency and difference, respectively.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Sally T. Ley whose telephone number is (571)272-3406. The examiner can normally be reached Monday - Thursday, 10:00am - 6:00pm ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Viker Lamardo can be reached at (571) 270-5871. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/STL/Examiner, Art Unit 2147                                                                                                                                                                                                        
/VIKER A LAMARDO/Supervisory Patent Examiner, Art Unit 2147

Read full office action

Prosecution Timeline

Jun 01, 2023

Application Filed

Mar 18, 2026

Non-Final Rejection mailed — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

17/981,796

Patent 12632746

A METHOD AND APPARATUS FOR DISPLAYING CATEGORIZED CARBON EMISSIONS

3y 6m to grant Granted May 19, 2026

16/733,393

Patent 12443830

COMPRESSED WEIGHT DISTRIBUTION IN NETWORKS OF NEURAL PROCESSORS

5y 9m to grant Granted Oct 14, 2025

16/835,892

Patent 12135927

EXPERT-IN-THE-LOOP AI FOR MATERIALS DISCOVERY

4y 7m to grant Granted Nov 05, 2024

17/992,958

Patent 11880776

GRAPH NEURAL NETWORK (GNN)-BASED PREDICTION SYSTEM FOR TOTAL ORGANIC CARBON (TOC) IN SHALE

1y 2m to grant Granted Jan 23, 2024

Study what changed to get past this examiner. Based on 4 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

1-2

Expected OA Rounds

19%

Grant Probability

53%

With Interview (+33.3%)

4y 8m (~1y 8m remaining)

Median Time to Grant

Low

PTA Risk

Based on 36 resolved cases by this examiner. Grant probability derived from career allowance rate.