Last updated: April 19, 2026
Application No. 17/297,839
SYSTEMS AND METHODS FOR MODEL TRAINING BASED ON FEATURE FUSION OF MULTIPLE DATA TYPES

Non-Final OA §101§103
Filed
May 27, 2021
Examiner
TRAN, AMY NMN
Art Unit
2126
Tech Center
2100 — Computer Architecture & Software
Assignee
Google LLC
OA Round
3 (Non-Final)
Interview Optional

— +47.9% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 28 resolved cases, 2023–2026
Examiner Intelligence

TRAN, AMY NMN View full profile →
Grants only 36% of cases
Career Allow Rate
10 granted / 28 resolved
-19.3% vs TC avg
Strong +48% interview lift
Without
With
+47.9%
Interview Lift
resolved cases with interview
Typical timeline
5y 2m
Avg Prosecution
24 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
32.5%
-7.5% vs TC avg
§103
44.2%
+4.2% vs TC avg
§102
6.0%
-34.0% vs TC avg
§112
15.6%
-24.4% vs TC avg
Black line = Tech Center average estimate • Based on career data from 28 resolved cases
Office Action

§101 §103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the
first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 08/06/2025 has been entered.
					Status of Claims	
	The amendments filed on 08/06/2025 has been entered. The present application is being examined under the claims filed on 08/06/2025. The status of the claims is as follows:
	Claims 1-20 remain pending in the application.
	Claims 1, 11 and 20 are amended.
Response to Arguments
	In reference to Rejections under 35 U.S.C 101:
	Applicant asserts in Remarks pg. 9-12 that the claims recite significantly more than an abstract idea because they incorporate a concrete technological improvement to machine learning training. Specifically, the claims require extracting common and unique features from different data modalities (e.g., text and image/ video), generating merged features via early fusion, and training a model using both merged and unique features. This approach reduces redundant multimedia data, lowers storage, and labelling burdens, and improves system memory usage and performance, directly addressing technical problems associated with conventional training methods that rely on large amounts of labeled multimedia data. The applicant contends that these elements are not merely “apply it” instructions, but instead reflect an inventive concept grounded in specific feature extraction, fusion, and training techniques disclosed in the specification, rendering the claims patent-eligible under 35 U.S.C 101.
	In response to Applicant’s arguments, the rejection under 35 U.S.C 101 is maintained. As set forth in the Office Action, the claimed steps of identifying features, generating merged features, training a model using merged and unique features, and classifying content are directed to abstract mental processes that can be performed by the human mind or by a computer as a tool, and therefore fall within a judicial exception under Step 2A. The recitation of “processing circuits”, feature extraction, feature merging, and model training does not meaningfully limit the claims to a specific technological implementation, but instead amounts to merely applying the abstract idea using generic computer components. Further, under Step 2B, the claims do not recite additional elements that amount to significantly more than the abstract idea, as the alleged improvements relate to the use of the abstract idea itself (i.e., organizing and combining information for training) rather than the improvement to the functioning of a computer or another technology. Accordingly, the claims fail to integrate the judicial exception into a practical application and remain ineligible under 35 U.S.C 101.
Applicant’s arguments filed on 08/06/2025 have been fully considered but they are not persuasive.
In reference to Rejections under 35 U.S.C 103:
	Argument:
	Applicant asserts in Remarks pg. 12-14 that the 35 U.S.C 103 rejection is improper because the cited references, alone or in combination, fail to teach or suggest the amended claim requirement of training a model using both merged features representing common features across modalities and separate subsets of features that are unique to each respective data type. In particular, while Kennedy discloses extracting embeddings from text and images, it does not disclose distinguishing and separately using features unique to each modality in conjunction with merged features for model training, as now expressly required by the claims. The Applicant further contends that Zhang does not disclose the claimed feature partitioning, and that Leonardo is not cited for this limitation. Accordingly, because none of Zhang, Leonardo or Kennedy teaches the claimed use of unique features separately from common (merged) features, and no articulated combination remedies this deficiency the Applicant asserts that the references do not render the amended claims obvious under 35 U.S.C 103.
	Response:
	Applicant’s argument is not persuasive. As mapped in the Office Action, Kennedy expressly discloses training a model using combined caption-image features together with modality-specific information. Kennedy teaches generating caption-image pairs, extracting features from both captions and images to produce featuring embeddings, and concatenating embeddings for each image/ caption pair, which corresponds to the claimed “merged features”. Kennedy further discloses extracting content features from images and content representations and clustering or dividing images or content representations into subsets based on similarities, which evidences features specific to a given data type. As shown in Col. 6, Kennedy discloses passing captions through a text modeling network to generate low-dimensional text embeddings and separately passing images through an image modeling subnetwork (i.e., a CNN trained on ImageNet) to generate image feature representations. These text embeddings and image embeddings are derived from their respective data types and therefore constitute to the features unique to the corresponding modality. Kennedy further teaches that for each image-caption pair, these representations may be concatenated together and provided as an input to downstream scoring and ranking models that are trained using labeled training pairs, thereby forming a combined (merged) representation used for model training. The fact that the merged representation is created by concatenation of modality-specific embeddings indicates the presence and use of both common cross-modal information (via paired image-caption input) and features originating uniquely from each modality (see Kennedy, Col. 6, Lines 38-63). Thus, Kennedy’s embeddings necessarily include both common (cross-modal) information and modality-specific feature information derived from the underlying text and image data. Applicant’s attempt to distinguish Kennedy on the basis that it allegedly does not “separately use” unique features is not supported by the claim language, which does not require distinct training stages or exclusion of modality-specific features from the merged representation. Applicant’s distinction between “merged features” and “unique features” is therefore not supported by the claimed language, which broadly recites training based on merged features and portions of first and second features. Accordingly, Kennedy discloses the claimed training using merged features in combination with features derived from each respective data type, and the 35 U.S.C 103 rejection is maintained.
	Applicant’s arguments filed on 08/06/2025 have been fully considered but they are not persuasive.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-20 are rejected under U.S.C 101 for containing an abstract idea without significantly more.
Regarding claim 1:
Step 1 – Is the claim to a process, machine, manufacture or composition of matter?
Yes, the claim is a process.
Step 2A – Prong 1 – Does the claim recite an abstract idea, law of nature, or natural phenomenon?
Yes, the claim recites an abstract idea.
identifying, [by the one or more processing circuits], first features of each of the plurality of first data elements; - This limitation is directed to the abstract idea of a mental process (including an observation, evaluation, judgement, opinion) which can be performed in the human mind, or by a human using pen and paper (see MPEP 2106.04(a)(2) Ill. C.)
identifying, [by the one or more processing circuits], second features of each of the plurality of second data elements; - This limitation is directed to the abstract idea of a mental process (including an observation, evaluation, judgement, opinion) which can be performed in the human mind, or by a human using pen and paper (see MPEP 2106.04(a)(2) Ill. C.)
generating, [by the one or more processing circuits], merged features by combining a first feature of the first features of each of the plurality of first data elements with a second feature of the second features of one of the plurality of second data elements, wherein the first feature and the second feature each represent a common feature; - This limitation is directed to the abstract idea of a mental process (including an observation, evaluation, judgement, opinion) which can be performed in the human mind, or by a human using pen and paper (see MPEP 2106.04(a)(2) Ill. C.)
classify a content item based on the model, wherein the content item includes content text and at least one of a content image or a content video.  - This limitation is directed to the abstract idea of a mental process (including an observation, evaluation, judgement, opinion) which can be performed in the human mind, or by a human using pen and paper (see MPEP 2106.04(a)(2) Ill. C.)
Step 2A – Prong 2 – Does the claim recite additional elements that integrate the judicial exception into a practical application?
No, there are no additional elements that integrate the judicial exception into a practical application. The additional elements:
by one or more processing circuits – This limitation is directed to a computer merely used as a tool to perform an existing process (see MPEP 2106.05(f) (2)).
receiving, [by one or more processing circuits], a plurality of first data elements of a first data type and a plurality of second data elements of a second data type, wherein the first data type is text data and the second data type is at least one of image data or video data; This limitation is directed to insignificant extra-solution activity – mere data gathering (see MPEP 2106.05(g)). 
training, [by the one or more processing circuits], a model based on the merged features and at least a first portion of the first features and a second portion of the second features,- Adding the words "apply it" (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea [see MPEP 2106.05(f)] and therefore fails to integrate the exception into a practical application.
wherein the first portion of the first features is a first subset of features unique to the first data type of the plurality of first data elements and the second portion of the second features is a second subset of features unique to the second data type of the plurality of second data elements; and Adding the words "apply it" (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea [see MPEP 2106.05(f)] and therefore fails to integrate the exception into a practical application.
Step 2B – Does the claim recite additional elements that amount to significantly more than the judicial exception?
No, there are no additional elements that amount to significantly more than the judicial exception. The additional elements are:
by one or more processing circuits – This limitation is directed to a computer merely used as a tool to perform an existing process (see MPEP 2106.05(f) (2)).
receiving, [by one or more processing circuits], a plurality of first data elements of a first data type and a plurality of second data elements of a second data type, wherein the first data type is text data and the second data type is at least one of image data or video data; This limitation is directed to insignificant extra-solution activity – mere data gathering (see MPEP 2106.05(g)). 
training, [by the one or more processing circuits], a model based on the merged features and at least a first portion of the first features and a second portion of the second features,- Adding the words "apply it" (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea [see MPEP 2106.05(f)] and therefore fails to integrate the exception into a practical application.
wherein the first portion of the first features is a first subset of features unique to the first data type of the plurality of first data elements and the second portion of the second features is a second subset of features unique to the second data type of the plurality of second data elements; and Adding the words "apply it" (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea [see MPEP 2106.05(f)] and therefore fails to integrate the exception into a practical application.
	Regarding claim 2,
Claim 2 is rejected under 35 U.S.C 101 because the claimed invention is directed to an abstract idea without significantly more. The claim is dependent on claim 1 which includes an abstract idea (see rejection for claim 1). The additional limitations:
wherein each of the plurality of first data elements is associated with one of the plurality of second data elements; This claim merely recites a further limitation on the receiving, [by one or more processing circuits], a plurality of first data elements of a first data type and a plurality of second data elements of a second data type, wherein the first data type is text data and the second data type is at least one of image data or video data from Claim 1 which was directed to insignificant extra-solution activity – mere data gathering (see MPEP 2106.05(g)). 
wherein generating, [by the one or more processing circuits], the merged features comprises combining the first feature of the first features of each of the plurality of first data elements with the second feature of the second features of the one of the plurality of second data elements that each of the plurality of first data elements is associated with.  - This limitation is directed to the abstract idea of a mental process (including an observation, evaluation, judgement, opinion) which can be performed in the human mind, or by a human using pen and paper (see MPEP 2106.04(a)(2) Ill. C.)
by the one or more processing circuits This limitation is directed to a computer merely used as a tool to perform an existing process (see MPEP 2106.05(f) (2)).
Regarding claim 3,
Claim 3 is rejected under 35 U.S.C 101 because the claimed invention is directed to an abstract idea without significantly more. The claim is dependent on claim 1 which includes an abstract idea (see rejection for claim 1). The additional limitations:
wherein identifying, [by the one or more processing circuits], the first features and the second features comprise applying one or more models to the plurality of first data elements and the plurality of second data elements - Adding the words "apply it" (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea [see MPEP 2106.05(f)] and therefore fails to integrate the exception into a practical application.
wherein the one or more models extract the first features from the plurality of first data elements and extract the second features from the plurality of second data elements. This limitation is directed to insignificant extra-solution activity – mere data outputting (see MPEP 2106.05(g)). 
by the one or more processing circuits – This limitation is directed to a computer merely used as a tool to perform an existing process (see MPEP 2106.05(f) (2)).
Regarding claim 4,
Claim 4 is rejected under 35 U.S.C 101 because the claimed invention is directed to an abstract idea without significantly more. The claim is dependent on claim 3 which includes an abstract idea (see rejection for claim 3). The additional limitations:
wherein the one or more models include at least one of an image embedding model, a video embedding model, an object recognition model, an audio translation model, and an optical character recognition model.  This claim merely recites a further limitation on the training, [by the one or more processing circuits], a model based on the common features and at least a portion of the first features and the second features from Claim 1 which was directed to Adding the words "apply it" (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea [see MPEP 2106.05(f)] and therefore fails to integrate the exception into a practical application.
Regarding claim 5,
Claim 5 is rejected under 35 U.S.C 101 because the claimed invention is directed to an abstract idea without significantly more. The claim is dependent on claim 1 which includes an abstract idea (see rejection for claim 1). The additional limitations:
wherein combining the first feature with the second feature comprises performing an operation on a first value of the first feature representing a first confidence of the first feature with a second value of the second feature representing a second confidence of the second feature. - This limitation is directed to mathematical calculation (see MPEP 2106.04(a)(2) l. C.) as it is combining the first feature with the second feature by using a mathematical operation, such as summation, subtraction, multiplication, averaging, determining a median, etc. (see Instant Specification ¶[0039])
Regarding claim 6,
Claim 6 is rejected under 35 U.S.C 101 because the claimed invention is directed to an abstract idea without significantly more. The claim is dependent on claim 5 which includes an abstract idea (see rejection for claim 5). The additional limitations:
wherein the operation is at least one of: a maximum operation that selects a maximum of the first value and the second value; This claim merely recites a further limitation on the wherein combining the first feature with the second feature comprises performing an operation on a first value of the first feature representing a first confidence of the first feature with a second value of the second feature representing a second confidence of the second feature  from Claim 5 which was directed to mathematical calculation (see MPEP 2106.04(a)(2) l. C.) as it is combining the first feature with the second feature by using a mathematical operation, such as summation, subtraction, multiplication, averaging, determining a median, etc. (see Instant Specification ¶[0039])
a summation operation that sums the first value and the second value; This claim merely recites a further limitation on the wherein combining the first feature with the second feature comprises performing an operation on a first value of the first feature representing a first confidence of the first feature with a second value of the second feature representing a second confidence of the second feature  from Claim 5 which was directed to mathematical calculation (see MPEP 2106.04(a)(2) l. C.) as it is combining the first feature with the second feature by using a mathematical operation, such as summation, subtraction, multiplication, averaging, determining a median, etc. (see Instant Specification ¶[0039])
a median operation that determines a median of the first value and the second value; and This claim merely recites a further limitation on the wherein combining the first feature with the second feature comprises performing an operation on a first value of the first feature representing a first confidence of the first feature with a second value of the second feature representing a second confidence of the second feature  from Claim 5 which was directed to mathematical calculation (see MPEP 2106.04(a)(2) l. C.) as it is combining the first feature with the second feature by using a mathematical operation, such as summation, subtraction, multiplication, averaging, determining a median, etc. (see Instant Specification ¶[0039])
a minimum operation that selects a minimum of the first value and the second value.  This claim merely recites a further limitation on the wherein combining the first feature with the second feature comprises performing an operation on a first value of the first feature representing a first confidence of the first feature with a second value of the second feature representing a second confidence of the second feature  from Claim 5 which was directed to mathematical calculation (see MPEP 2106.04(a)(2) l. C.) as it is combining the first feature with the second feature by using a mathematical operation, such as summation, subtraction, multiplication, averaging, determining a median, etc. (see Instant Specification ¶[0039])
Regarding claim 7,
Claim 7 is rejected under 35 U.S.C 101 because the claimed invention is directed to an abstract idea without significantly more. The claim is dependent on claim 1 which includes an abstract idea (see rejection for claim 1). The additional limitations:
by the one or more processing circuits – This limitation is directed to a computer merely used as a tool to perform an existing process (see MPEP 2106.05(f) (2)).
receiving, [by the one or more processing circuits], a data element comprising a first data element of the first data type and a second data element of the second data type; This limitation is directed to insignificant extra-solution activity – mere data gathering (see MPEP 2106.05(g)). 
extracting, [by the one or more processing circuits], first inference features of the first data element and second inference features of the second data element; This limitation is directed to insignificant extra-solution activity – mere data outputting (see MPEP 2106.05(g)). 
generating, [by the one or more processing circuits], one or more merged features by combining one or more of the first inference features with one or more of the second inference features, wherein each of the one or more of the first inference features is a particular common feature to one of the one or more of the second inference features; - This limitation is directed to the abstract idea of a mental process (including an observation, evaluation, judgement, opinion) which can be performed in the human mind, or by a human using pen and paper (see MPEP 2106.04(a)(2) Ill. C.)
identifying, [by the one or more processing circuits], unique first classification features of the first subset of features unique to the first data type; - This limitation is directed to the abstract idea of a mental process (including an observation, evaluation, judgement, opinion) which can be performed in the human mind, or by a human using pen and paper (see MPEP 2106.04(a)(2) Ill. C.)
identifying, [by the one or more processing circuits], unique second classification features of the second subset of features unique to the second data type; and - This limitation is directed to the abstract idea of a mental process (including an observation, evaluation, judgement, opinion) which can be performed in the human mind, or by a human using pen and paper (see MPEP 2106.04(a)(2) Ill. C.)
generating, [by the one or more processing circuits], a model output of the model by applying the one or more merged features, the unique first classification features, the unique second classification features as inputs to the model.  - Adding the words "apply it" (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea [see MPEP 2106.05(f)] and therefore fails to integrate the exception into a practical application.
Regarding claim 8,
Claim 8 is rejected under 35 U.S.C 101 because the claimed invention is directed to an abstract idea without significantly more. The claim is dependent on claim 1 which includes an abstract idea (see rejection for claim 1). The additional limitations:
wherein the first data type is a text based data type and the second data type is at least one of an image data type or a video data type. This claim merely recites a further limitation on the receiving, [by one or more processing circuits], a plurality of first data elements of a first data type and a plurality of second data elements of a second data type, wherein the first data type is text data and the second data type is at least one of image data or video data from Claim 1 which was directed to insignificant extra-solution activity – mere data gathering (see MPEP 2106.05(g)). 
Regarding claim 9,
Claim 9 is rejected under 35 U.S.C 101 because the claimed invention is directed to an abstract idea without significantly more. The claim is dependent on claim 8 which includes an abstract idea (see rejection for claim 8). The additional limitations:
wherein at least a first portion of the plurality of first data elements are associated with first data element labels and at least a second portion of the plurality of second data elements are associated with second data element labels, This claim merely recites a further limitation on the receiving, [by one or more processing circuits], a plurality of first data elements of a first data type and a plurality of second data elements of a second data type, wherein the first data type is text data and the second data type is at least one of image data or video data from Claim 1 which was directed to insignificant extra-solution activity – mere data gathering (see MPEP 2106.05(g)). 
wherein a first number of the first data element labels is greater than a second number of the second data element labels; This claim merely recites a further limitation on the receiving, [by one or more processing circuits], a plurality of first data elements of a first data type and a plurality of second data elements of a second data type, wherein the first data type is text data and the second data type is at least one of image data or video data from Claim 1 which was directed to insignificant extra-solution activity – mere data gathering (see MPEP 2106.05(g)). 
wherein training, [by the one or more processing circuits], the model is further based on the first data element labels and the second data element labels - Adding the words "apply it" (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea [see MPEP 2106.05(f)] and therefore fails to integrate the exception into a practical application.
by the one or more processing circuits – This limitation is directed to a computer merely used as a tool to perform an existing process (see MPEP 2106.05(f) (2)).
Regarding claim 10,
Claim 10 is rejected under 35 U.S.C 101 because the claimed invention is directed to an abstract idea without significantly more. The claim is dependent on claim 8 which includes an abstract idea (see rejection for claim 8). The additional limitations:
wherein at least a first portion of the plurality of first data elements are associated with first data element labels and none of the plurality of second data elements are associated with second data element labels; This claim merely recites a further limitation on the receiving, [by one or more processing circuits], a plurality of first data elements of a first data type and a plurality of second data elements of a second data type, wherein the first data type is text data and the second data type is at least one of image data or video data from Claim 1 which was directed to insignificant extra-solution activity – mere data gathering (see MPEP 2106.05(g)). 
wherein training, [by the one or more processing circuits], the model is further based on the first data element labels. - Adding the words "apply it" (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea [see MPEP 2106.05(f)] and therefore fails to integrate the exception into a practical application.
by the one or more processing circuits – This limitation is directed to a computer merely used as a tool to perform an existing process (see MPEP 2106.05(f) (2)).
Regarding claim 11:
Step 1 – Is the claim to a process, machine, manufacture or composition of matter?
Yes, the claim is a process.
Step 2A – Prong 1 – Does the claim recite an abstract idea, law of nature, or natural phenomenon?
Yes, the claim recites an abstract idea.
identify first features of each of the plurality of first data elements;- This limitation is directed to the abstract idea of a mental process (including an observation, evaluation, judgement, opinion) which can be performed in the human mind, or by a human using pen and paper (see MPEP 2106.04(a)(2) Ill. C.)
identify second features of each of the plurality of second data elements;- This limitation is directed to the abstract idea of a mental process (including an observation, evaluation, judgement, opinion) which can be performed in the human mind, or by a human using pen and paper (see MPEP 2106.04(a)(2) Ill. C.)
generate merged features by combining a first feature of the first features of each of the plurality of first data elements with a second feature of the second features of one of the plurality of second data elements, wherein the first feature and the second feature each represent a common feature;- This limitation is directed to the abstract idea of a mental process (including an observation, evaluation, judgement, opinion) which can be performed in the human mind, or by a human using pen and paper (see MPEP 2106.04(a)(2) Ill. C.)
Step 2A – Prong 2 – Does the claim recite additional elements that integrate the judicial exception into a practical application?
No, there are no additional elements that integrate the judicial exception into a practical application. The additional elements:
A system including one or more memory devices configured to store instructions thereon, that, when executed by one or more processors, cause the one or more processors to – This limitation is directed to a computer merely used as a tool to perform an existing process (see MPEP 2106.05(f) (2)).
receive a plurality of first data elements of a first data type and a plurality of second data elements of a second data type; This limitation is directed to insignificant extra-solution activity – mere data gathering (see MPEP 2106.05(g)). 
train a model based on the merged features and at least a first portion of the first features and a second portion of the second features,   - Adding the words "apply it" (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea [see MPEP 2106.05(f)] and therefore fails to integrate the exception into a practical application.
wherein the first portion of the first features is a first subset of features unique to the first data type of the plurality of first data elements and the second portion of the second features is a second subset of features unique to the second data type of the plurality of second data elements - Adding the words "apply it" (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea [see MPEP 2106.05(f)] and therefore fails to integrate the exception into a practical application.
Step 2B – Does the claim recite additional elements that amount to significantly more than the judicial exception?
No, there are no additional elements that amount to significantly more than the judicial exception. The additional elements are:
A system including one or more memory devices configured to store instructions thereon, that, when executed by one or more processors, cause the one or more processors to – This limitation is directed to a computer merely used as a tool to perform an existing process (see MPEP 2106.05(f) (2)).
receive a plurality of first data elements of a first data type and a plurality of second data elements of a second data type; This limitation is directed to insignificant extra-solution activity – mere data gathering (see MPEP 2106.05(g)). 
train a model based on the merged features and at least a first portion of the first features and a second portion of the second features,   - Adding the words "apply it" (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea [see MPEP 2106.05(f)] and therefore fails to integrate the exception into a practical application.
wherein the first portion of the first features is a first subset of features unique to the first data type of the plurality of first data elements and the second portion of the second features is a second subset of features unique to the second data type of the plurality of second data elements - Adding the words "apply it" (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea [see MPEP 2106.05(f)] and therefore fails to integrate the exception into a practical application.
Regarding claim 12,
Claim 12 is rejected under 35 U.S.C 101 because the claimed invention is directed to an abstract idea without significantly more. The claim is dependent on claim 11 which includes an abstract idea (see rejection for claim 11). The additional limitations:
wherein each of the plurality of first data elements is associated with one of the plurality of second data elements; This claim merely recites a further limitation on the receive a plurality of first data elements of a first data type and a plurality of second data elements of a second data type from Claim 11 which was directed to insignificant extra-solution activity – mere data gathering (see MPEP 2106.05(g)). 
wherein the instructions cause the one or more processors to – This limitation is directed to a computer merely used as a tool to perform an existing process (see MPEP 2106.05(f) (2)).
generate the merged features comprises combining the first feature of the first features of each of the plurality of first data elements with the second feature of the second features of the one of the plurality of second data elements that each of the plurality of first data elements is associated with. - This limitation is directed to the abstract idea of a mental process (including an observation, evaluation, judgement, opinion) which can be performed in the human mind, or by a human using pen and paper (see MPEP 2106.04(a)(2) Ill. C.)
Regarding claim 13,
Claim 13 is rejected under 35 U.S.C 101 because the claimed invention is directed to an abstract idea without significantly more. The claim is dependent on claim 11 which includes an abstract idea (see rejection for claim 11). The additional limitations:
wherein the instructions cause the one or more processors to – This limitation is directed to a computer merely used as a tool to perform an existing process (see MPEP 2106.05(f) (2)).
identify the first features and the second features comprise applying one or more models to the plurality of first data elements and the plurality of second data elements, - Adding the words "apply it" (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea [see MPEP 2106.05(f)] and therefore fails to integrate the exception into a practical application.
wherein the one or more models extract the first features from the plurality of first data elements and extract the second features from the plurality of second data elements.  This limitation is directed to insignificant extra-solution activity – mere data outputting (see MPEP 2106.05(g)). 
Regarding claim 14,
Claim 14 is rejected under 35 U.S.C 101 because the claimed invention is directed to an abstract idea without significantly more. The claim is dependent on claim 11 which includes an abstract idea (see rejection for claim 11). The additional limitations:
wherein combining the first feature with the second feature comprises performing an operation on a first value of the first feature representing a first confidence of the first feature with a second value of the second feature representing a second confidence of the second feature.  - This limitation is directed to mathematical calculation (see MPEP 2106.04(a)(2) l. C.) as it is combining the first feature with the second feature by using a mathematical operation, such as summation, subtraction, multiplication, averaging, determining a median, etc. (see Instant Specification ¶[0039])
Regarding claim 15,
Claim 15 is rejected under 35 U.S.C 101 because the claimed invention is directed to an abstract idea without significantly more. The claim is dependent on claim 11 which includes an abstract idea (see rejection for claim 11). The additional limitations:
wherein the instructions cause the one or more processors to: – This limitation is directed to a computer merely used as a tool to perform an existing process (see MPEP 2106.05(f) (2)).
receive a data element comprising a first data element of the first data type and a second data element of the second data type; This limitation is directed to insignificant extra-solution activity – mere data gathering (see MPEP 2106.05(g)). 
extract first inference features of the first data element and second inference features of the second data element; This limitation is directed to insignificant extra-solution activity – mere data outputting (see MPEP 2106.05(g)). 
generate one or more merged features by combining one or more of the first inference features with one or more of the second inference features, wherein each of the one or more of the first inference features is a particular common feature to one of the one or more of the second inference features; - This limitation is directed to the abstract idea of a mental process (including an observation, evaluation, judgement, opinion) which can be performed in the human mind, or by a human using pen and paper (see MPEP 2106.04(a)(2) Ill. C.)
identify unique first classification features of the first subset of features unique to the first data type; - This limitation is directed to the abstract idea of a mental process (including an observation, evaluation, judgement, opinion) which can be performed in the human mind, or by a human using pen and paper (see MPEP 2106.04(a)(2) Ill. C.)
identify unique second classification features of the second subset of features unique to the second data type; and - This limitation is directed to the abstract idea of a mental process (including an observation, evaluation, judgement, opinion) which can be performed in the human mind, or by a human using pen and paper (see MPEP 2106.04(a)(2) Ill. C.)
generate a model output of the model by applying the one or more merged features, the unique first classification features, the unique second classification features as inputs to the model.  - Adding the words "apply it" (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea [see MPEP 2106.05(f)] and therefore fails to integrate the exception into a practical application.
Regarding claim 16,
Claim 16 is rejected under 35 U.S.C 101 because the claimed invention is directed to an abstract idea without significantly more. The claim is dependent on claim 15 which includes an abstract idea (see rejection for claim 15). The additional limitations:
wherein the data element is a content item comprising multiple content types, wherein the first data element is text data while the second data element is at least one of image data or video data. This claim merely recites a further limitation on the receive a data element comprising a first data element of the first data type and a second data element of the second data type from Claim 15 which was directed to insignificant extra-solution activity – mere data gathering (see MPEP 2106.05(g)). 
Regarding claim 17,
Claim 17 is rejected under 35 U.S.C 101 because the claimed invention is directed to an abstract idea without significantly more. The claim is dependent on claim 11 which includes an abstract idea (see rejection for claim 11). The additional limitations:
wherein the first data type is a text based data type and the second data type is at least one of an image data type or a video data type.  This claim merely recites a further limitation on the receive a data element comprising a first data element of the first data type and a second data element of the second data type from Claim 11 which was directed to insignificant extra-solution activity – mere data gathering (see MPEP 2106.05(g)). 
Regarding claim 18,
Claim 18 is rejected under 35 U.S.C 101 because the claimed invention is directed to an abstract idea without significantly more. The claim is dependent on claim 17 which includes an abstract idea (see rejection for claim 17). The additional limitations:
wherein at least a first portion of the plurality of first data elements are associated with first data element labels and at least a second portion of the plurality of second data elements are associated with second data element labels, This claim merely recites a further limitation on the receive a data element comprising a first data element of the first data type and a second data element of the second data type from Claim 11 which was directed to insignificant extra-solution activity – mere data gathering (see MPEP 2106.05(g)). 
wherein a first number of the first data element labels is greater than a second number of the second data element labels; This claim merely recites a further limitation on the receive a data element comprising a first data element of the first data type and a second data element of the second data type from Claim 11 which was directed to insignificant extra-solution activity – mere data gathering (see MPEP 2106.05(g)). 
wherein the instructions cause the one or more processors to – This limitation is directed to a computer merely used as a tool to perform an existing process (see MPEP 2106.05(f) (2)).
train the model further based on the first data element labels and the second data element labels. - Adding the words "apply it" (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea [see MPEP 2106.05(f)] and therefore fails to integrate the exception into a practical application.
Regarding claim 19,
Claim 19 is rejected under 35 U.S.C 101 because the claimed invention is directed to an abstract idea without significantly more. The claim is dependent on claim 17 which includes an abstract idea (see rejection for claim 17). The additional limitations:
wherein at least a first portion of the plurality of first data elements are associated with first data element labels and none of the plurality of second data elements are associated with second data element labels; This claim merely recites a further limitation on the receive a data element comprising a first data element of the first data type and a second data element of the second data type from Claim 11 which was directed to insignificant extra-solution activity – mere data gathering (see MPEP 2106.05(g)). 
Regarding claim 20:
Step 1 – Is the claim to a process, machine, manufacture or composition of matter?
Yes, the claim is a process.
Step 2A – Prong 1 – Does the claim recite an abstract idea, law of nature, or natural phenomenon?
Yes, the claim recites an abstract idea.
identify first features of each of the plurality of first data elements;- This limitation is directed to the abstract idea of a mental process (including an observation, evaluation, judgement, opinion) which can be performed in the human mind, or by a human using pen and paper (see MPEP 2106.04(a)(2) Ill. C.)
identify features of each of the plurality of second data elements;- This limitation is directed to the abstract idea of a mental process (including an observation, evaluation, judgement, opinion) which can be performed in the human mind, or by a human using pen and paper (see MPEP 2106.04(a)(2) Ill. C.)
generate merged features by combining a first feature of the first features of each of the plurality of first data elements with a second feature of the second features of one of the plurality of second data elements, wherein the first feature and the second feature each represent a common feature;- This limitation is directed to the abstract idea of a mental process (including an observation, evaluation, judgement, opinion) which can be performed in the human mind, or by a human using pen and paper (see MPEP 2106.04(a)(2) Ill. C.)
Step 2A – Prong 2 – Does the claim recite additional elements that integrate the judicial exception into a practical application?
No, there are no additional elements that integrate the judicial exception into a practical application. The additional elements:
One or more computer readable storage media configured to store instructions thereon that, when executed by one or more processors, cause the one or more processors to:– This limitation is directed to a computer merely used as a tool to perform an existing process (see MPEP 2106.05(f) (2)).
receive a plurality of first data elements of a first data type and a plurality of second data elements of a second data type; This limitation is directed to insignificant extra-solution activity – mere data gathering (see MPEP 2106.05(g)). 
train a model based on the merged features and at least a first portion of the first features and a second portion of the second features,   - Adding the words "apply it" (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea [see MPEP 2106.05(f)] and therefore fails to integrate the exception into a practical application.
wherein the first portion of the first features is a first subset of features unique to the first data type of the plurality of first data elements and the second portion of the second features is a second subset of features unique to the second data type of the plurality of second data elements - Adding the words "apply it" (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea [see MPEP 2106.05(f)] and therefore fails to integrate the exception into a practical application.
Step 2B – Does the claim recite additional elements that amount to significantly more than the judicial exception?
No, there are no additional elements that amount to significantly more than the judicial exception. The additional elements are:
One or more computer readable storage media configured to store instructions thereon that, when executed by one or more processors, cause the one or more processors to:– This limitation is directed to a computer merely used as a tool to perform an existing process (see MPEP 2106.05(f) (2)).
receive a plurality of first data elements of a first data type and a plurality of second data elements of a second data type; This limitation is directed to insignificant extra-solution activity – mere data gathering (see MPEP 2106.05(g)). 
train a model based on the merged features and at least a first portion of the first features and a second portion of the second features,   - Adding the words "apply it" (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea [see MPEP 2106.05(f)] and therefore fails to integrate the exception into a practical application.
wherein the first portion of the first features is a first subset of features unique to the first data type of the plurality of first data elements and the second portion of the second features is a second subset of features unique to the second data type of the plurality of second data elements - Adding the words "apply it" (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea [see MPEP 2106.05(f)] and therefore fails to integrate the exception into a practical application.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claim(s) 1-10, 13-16, and 18-19 are rejected under 35 U.S.C. 103 as being unpatentable over Zhang et al. (“Cross-Modal and Hierarchical Modeling of Video and Text”) (hereafter referred to as “Zhang”) in view of Leonardo et al. (“Fusing Visual and Textual Information to Determine Content Safety”) (hereafter referred to as “Leonardo”) and further in view of Kennedy et al (US 10,789,284 B2) (hereafter referred to as “Kennedy”)
As per claim 1, Zhang explicitly discloses:
A method comprising: receiving, by one or more processing circuits, a plurality of first data elements of a first data type and a plurality of second data elements of a second data type, wherein the first data type is text data and the second data type is at least one of image data or video data; (Zhang, Page 2, Figure 1: “Conceptual diagram of our approach for cross-modal modeling of video and texts. The main idea is to embed both low-level (clips and sentences) and high-level (video and paragraph) in their own semantic spaces coherently. 
    PNG
    media_image1.png
    317
    626
    media_image1.png
    Greyscale
”) [Examiner’s note: a first data elements i.e., multiple sentences for text data, a second data elements i.e., multiple clips for video data]
identifying, by the one or more processing circuits, first features of each of the
plurality of first data elements; (Zhang, Page 4, Section 3.1: “Likewise, we assume there is a paragraph of texts describing the video. The paragraph p contains n sentences, one for each video clip. Let si denote the ith sentence and wij the feature for the jth word out of n’i words.”) (Examiner’s note: features of first data elements is being interpreted as feature wij  for the jth word of multiple sentences (i.e., first data elements as shown above) for text data)
identifying, by the one or more processing circuits, second features of each of the plurality of second data elements; (Zhang, Page 4, Section 3.1: “A video v has n clips (or subshots), where each clip ci contains ni frames. Each frame is represented by a visual feature vector xij .”) (Examiner’s note: features of second data elements is being interpreted as a visual feature vector xij of multiple clips (i.e., second data elements as shown above) for video data)
generating, by the one or more processing circuits, merged features by combining a first feature of the first features of each of the plurality of first data elements with a second feature of the second features of one of the plurality of second data elements, (Zhang, Page 12, Section 4.3: “In addition to the video paragraph retrieval, we evaluate our learned embeddings for video captioning. Specifically, we follow [20] and train a caption model [40] on top of the pre-trained video embeddings. Similar to [20], we concatenate the clip-level feature with contextual video-level feature, and build a two-layer LSTM as a caption generator.”) [Examiner’s note: merged features is being interpreted as the learned embeddings for video captioning]
wherein the first feature and the second feature each represent a common feature; (Zhang, Page, 8, Section 4.1: “Each video contains multiple clips and a corresponding paragraph with sentences aligned to the clips.”, Page 2, Figure 1: “Conceptual diagram of our approach for cross-modal modeling of video and texts. The main idea is to embed both low-level (clips and sentences) and high-level (video and paragraph) in their own semantic spaces coherently. As shown in the figure, the 3 sentences (and the corresponding 3 clips) are mapped into a local embedding space where the corresponding pairs of clips and sentences are placed close to each other. As a whole, the videos and the paragraphs are mapped into a global semantic space where their embeddings are close.
    PNG
    media_image1.png
    317
    626
    media_image1.png
    Greyscale
”) [Examiner’s note: pairs of clips and sentences in red, blue, green share a common feature i.e., a corresponding paragraph with sentences aligned to the video clips]
Zhang fails to disclose:
	by one or more processing circuits
training, by the one or more processing circuits, a model based on the merged features and at least a first portion of the first features and a second portion of the second features,
wherein the first portion of the first features is a first subset of features unique to the first data type of the plurality of first data elements and the second portion of the second features is a second subset of features unique to the second data type of the plurality of second data elements 
classify a content item based on the model, wherein the content item includes content text and at least one of a content image or a content video
However, Kennedy explicitly discloses
by one or more processing circuits (Kennedy, Col. 16, Lines 59-67: “Processor(s) 
810 can execute under any operating system (OS) (not shown), in a native or virtual environment. One or more applications can be deployed that include logic unit 855, application programming interface (API) unit 860, input unit 865, output unit 870, pairing unit 875, feature extraction unit 880, ranking unit 885, pair association unit 890, and inter-unit communication mechanism 895 for the different units to communicate with each other, with the OS, and with other applications (not shown).”)
training, by the one or more processing circuits, a model based on the merged features and at least a first portion of the first features and a second portion of the second features, (Kennedy, Col. 5, Lines 30-47: “As illustrated in FIG. 1, user input is received by the system at 105. This user input may include a user's set of images and user-provided captions that the user would like illustrated. After the user input has been received, the system may generate caption-image pairs at 110 by pairing each provided caption with each possible image in the user provided image set. At 115, features may be extracted for both the captions and images to produce featuring embeddings associated with both the captions and the images. The feature extraction of 115 may be performed using a neural network that has been trained at 135 using training data. The training data may include user manually-generated image collection or photo books that have been captioned by users using photo album editing or photo book generation platforms, which may be web-based, mobile application based, or installed on the local machine. For each image/caption pair, the features (embeddings) may be concatenated together.”) [Examiner’s note: “the merged features” is being interpreted as the “image-caption pair data”, “a first portion of the first features and a second portion of the second features” is being interpreted as the “featuring embeddings associated with both the captions and the images”]
wherein the first portion of the first features is a first subset of features unique to the first data type of the plurality of first data elements and the second portion of the second features is a second subset of features unique to the second data type of the plurality of second data elements 
(Kennedy, Col. 11, Lines 56-63: “In the process 500, a plurality of images or content representations may be clustered or divided into subsets of images or content representations based on content similarities between different images or content representations at 505. This may include extracting content features from the content representations or images to produce featuring embeddings. The feature extraction may be performed using a neural network that has been trained using training data.”)
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Zhang and Kennedy. Zhang teaches a novel cross-modal learning approach to model both videos and texts jointly. Kennedy teaches system and method for associating textual summaries with content media. One of ordinary skill would have motivation to combine Zhang and Kennedy to enable the model to learn joint representations that capture correlations across different data types (e.g., text and image), while the retention of unique feature subsets preserves modality-specific information that may otherwise be diluted or list during fusion. This approach enhances learning capacity and flexibility, particularly in multi-modal systems where different data sources contribute complementary insights.
However, Leonardo explicitly discloses:
classify a content item based on the model, wherein the content item includes content text and at least one of a content image or a content video.  (Leonardo, Page 2027, Col. 1, Section II, ¶2]: “Since web pages are composed of images and text, we use multimodal machine learning for content safety classification. Multimodal machine learning is an approach used to fuse various modalities, or ways in which data is perceived and experienced, to improve the classification of data”, Page 2027, Col. 1, Section II, ¶[3]: “Most of the research so far in content safety classification is done using textual information. However, when classifying the content safety of web pages, it is important to consider both visual and textual data. Our work uses a multimodal approach, combining visual and textual features from web pages, for content safety classification in advertising.”) [Examiner’s note: classify a content item i.e., classify web pages which contain visual and textual features for content safety in advertising, model i.e., the multimodal machine learning model]
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Zhang and Leonardo. Zhang teaches a novel cross-modal learning approach to model both videos and texts jointly. Leonardo teaches a multimodal machine learning framework that fuses visual and textual information from web pages to improve current predictions of content safety. One of ordinary skill would have motivation to combine Zhang and Leonardo to improve the classification of data by using multimodal machine learning to fuse various modalities, which will help to enhance the accuracy of classification model (Leonardo, Page 2027, Col. 1, Section II)
As per claim 2, the combination of Zhang, Kennedy and Leonardo discloses all the limitations of claim 1 (as shown in the rejections above).
Zhang in view of Kennedy and Leonardo further discloses:
wherein each of the plurality of first data elements is associated with one of the plurality of second data elements; (Zhang, Page 8, Section 4.1, ¶[1]: “Each video contains multiple clips and a corresponding paragraph with sentences aligned to the clips.”, Page 2, Figure 1: 
    PNG
    media_image1.png
    317
    626
    media_image1.png
    Greyscale
”) [Examiner’s note: first data elements i.e., multiple sentences of text data, second data elements i.e., multiple clips of video data]
wherein generating, by the one or more processing circuits, the merged features comprises combining the first feature of the first features of each of the plurality of first data elements with the second feature of the second features of the one of the plurality of second data elements that each of the plurality of first data elements is associated with.  (Zhang, Page 12, Section 4.3: “In addition to the video paragraph retrieval, we evaluate our learned embeddings for video captioning. Specifically, we follow [20] and train a caption model [40] on top of the pre-trained video embeddings. Similar to [20], we concatenate the clip-level feature with contextual video-level feature, and build a two-layer LSTM as a caption generator.”) [Examiner’s note: merged features is being interpreted as the learned embeddings for video captioning]
As per claim 3, the combination of Zhang, Kennedy and Leonardo discloses all the limitations of claim 1 (as shown in the rejections above).
Zhang in view of Kennedy and Leonardo further discloses:
wherein identifying, by the one or more processing circuits, the first features and the second features comprise (Zhang, Page 4, Section 3.1: “Likewise, we assume there is a paragraph of texts describing the video. The paragraph p contains n sentences, one for each video clip. Let si denote the ith sentence and wij the feature for the jth word out of n’i words.”, Zhang, Page 4, Section 3.1: “A video v has n clips (or subshots), where each clip ci contains ni frames. Each frame is represented by a visual feature vector xij .”) (Examiner’s note: features of first data elements is being interpreted as feature wij  for the jth word of multiple sentences (i.e., first data elements as shown above) for text data, features of second data elements is being interpreted as a visual feature vector xij of multiple clips (i.e., second data elements as shown above) for video data)
applying one or more models to the plurality of first data elements and the plurality of second data elements, (Leonardo, Page 2026, Col. 2, ¶[2]: “In early fusion, intermediate features from separate modalities are extracted and jointly represented, then learned with one single model. Given separate, pre-trained computer vision (CV) and natural language processing (NLP) models, our fully automated framework uses late fusion to classify web pages as either safe or threat, as well as into 10 possible threat categories.”, Leonardo, Page 2027, Col. 1, Section III.A, ¶[1-2]: “We are analyzing two modalities present in a web page: visual signals and natural language. One classifier for each modality was trained to determine content safety. For the CV model, a Squeeze-and-Excitation (SE) Network [9] was trained using initial weights from ImageNet’s image classification task [10] and fine-tuned on web page images. For the NLP model, a Universal Language Model Finetuning Framework (ULMFiT) [11] was trained and fine-tuned.”,) [Examiner’s note: “one or more models” is being interpreted as the pre-trained computer vision (CV) and natural language processing (NPL) models, first data elements i.e., natural language of web pages, second data elements i.e., visual signals or images of web pages]
wherein the one or more models extract the first features from the plurality of first data elements and extract the second features from the plurality of second data elements. (Leonardo, Page 2029, Col. 2, Section 2: “CNNs are used extensively for difficult classification tasks due to their ability to find patterns in the input data with high accuracy [22]. They are especially useful on spatial input data in which the output depends on the position of each individual feature. For this work, CNNs are used on the intermediate features extracted from the CV model and the NLP model to perform binary classification.”) [Examiner’s note: “one or more models” is being interpreted as the pre-trained computer vision (CV) and natural language processing (NPL) models, first and second features of first and second data elements i.e., the intermediate features of text data and image data]
As per claim 4, the combination of Zhang, Kennedy and Leonardo discloses all the limitations of claim 3 (as shown in the rejections above).
Zhang in view of Kennedy and Leonardo further discloses:
wherein the one or more models include at least one of an image embedding model, a video embedding model, an object recognition model, an audio translation model, and an optical character recognition model. (Leonardo, Page 2026, Col. 2, ¶[2]: “Given separate, pre-trained computer vision (CV) and natural language processing (NLP) models, our fully automated framework uses late fusion to classify web pages as either safe or threat, as well as into 10 possible threat categories.”, Page 2027, Col. 1, Section III.A: “For the CV model, a Squeeze-and-Excitation (SE) Network [9] was trained using initial weights from ImageNet’s image classification task [10] and fine-tuned on web page images (Figure 1).”) [Examiner’s note: computer vision model i.e., an object recognition model]
As per claim 5, the combination of Zhang, Kennedy and Leonardo discloses all the limitations of claim 1 (as shown in the rejections above).
Zhang in view of Kennedy and Leonardo further discloses:
wherein combining the first feature with the second feature comprises performing an operation on a first value of the first feature representing a first confidence of the first feature with a second value of the second feature representing a second confidence of the second feature. (Leonardo, Page 2028, Col. 1, Section IV.A.1: “Since unique web pages in the dataset are often associated with multiple images, the visual features of these images are combined first by choosing the minimum value of x0 (confidence score for TC0) and maximum value for xi (confidence score for threat TCi, i = 1, 2, . . . , 9), componentwise across all the images corresponding to the same web page. The method used for merging different x’s using the aforementioned criteria is named minmax.… Now, a single 10-dimensional vector represents the features from all the images in a web page and another 10-dimensional vector represents the textual features. In order to merge both and train them altogether, we can average each component, concatenate them both or use the minmax method. The results from any of the three methods arrive at xwp, a vector that’s comprised of visual and textual features from a web page, while y represents xwp’s ground truth.”) [Examiner’s note: an operation i.e., the minmax method, first value representing a first confidence i.e., the confidence score for visual features, the second value representing a second confidence i.e., the confidence score for textual features]
As per claim 6, the combination of Zhang, Kennedy and Leonardo discloses all the limitations of claim 5 (as shown in the rejections above).
Zhang in view of Kennedy and Leonardo further discloses:
wherein the operation is at least one of: a maximum operation that selects a maximum of the first value and the second value;  a summation operation that sums the first value and the second value;  a median operation that determines a median of the first value and the second value; and a minimum operation that selects a minimum of the first value and the second value. (Leonardo, Page 2028, Col. 1, Section IV.A.1: “Since unique web pages in the dataset are often associated with multiple images, the visual features of these images are combined first by choosing the minimum value of x0 (confidence score for TC0) and maximum value for xi (confidence score for threat TCi, i = 1, 2, . . . , 9), componentwise across all the images corresponding to the same web page. The method used for merging different x’s using the aforementioned criteria is named minmax.… Now, a single 10-dimensional vector represents the features from all the images in a web page and another 10-dimensional vector represents the textual features. In order to merge both and train them altogether, we can average each component, concatenate them both or use the minmax method. The results from any of the three methods arrive at xwp, a vector that’s comprised of visual and textual features from a web page, while y represents xwp’s ground truth.”) [Examiner’s note: first and second value i.e., the confidence score corresponding to visual features and textual features, the minmax operation i.e., selecting maximum value and selecting minimum value, concatenation operation i.e., summation operation] 
As per claim 7, the combination of Zhang, Kennedy and Leonardo discloses all the limitations of claim 1 (as shown in the rejections above).
Zhang in view of Kennedy and Leonardo further discloses:
receiving, by the one or more processing circuits, a data element comprising a first data element of the first data type and a second data element of the second data type; (Zhang, Page 2, Figure 1: “Conceptual diagram of our approach for cross-modal modeling of video and texts. The main idea is to embed both low-level (clips and sentences) and high-level (video and paragraph) in their own semantic spaces coherently. 
    PNG
    media_image1.png
    317
    626
    media_image1.png
    Greyscale
”) [Examiner’s note: a first data elements i.e., multiple sentences for text data, a second data elements i.e., multiple clips for video data]
by one or more processing circuits (Kennedy, Col. 16, Lines 59-67: “Processor(s) 810 can execute under any operating system (OS) (not shown), in a native or virtual environment. One or more applications can be deployed that include logic unit 855, application programming interface (API) unit 860, input unit 865, output unit 870, pairing unit 875, feature extraction unit 880, ranking unit 885, pair association unit 890, and inter-unit communication mechanism 895 for the different units to communicate with each other, with the OS, and with other applications (not shown).”)
extracting, by the one or more processing circuits, first inference features of the first data element and second inference features of the second data element; (Zhang, Page 18, Section A.1, ¶[1]: “In all our experiments under this setting, we extract frame-wise video feature using C3D model pre-trained on Sports-1M dataset, with the temporal stride of 16.”, Page 18, Section A.1, ¶[3]: “Word Features. In the retrieval related experiments, we always use GloVE features [30] for the initialization of the word embedding and fine-tune. Specifically, we use the GloVE vectors pre-trained on 840B common web-crawled data, with its dimensionality equals to 300.”) [Examiner’s note: first inference features i.e., the video features, second inference features i.e., word features ]
generating, by the one or more processing circuits, one or more merged features by combining one or more of the first inference features with one or more of the second inference features, (Zhang, Page 12, Section 4.3: “In addition to the video paragraph retrieval, we evaluate our learned embeddings for video captioning. Specifically, we follow [20] and train a caption model [40] on top of the pre-trained video embeddings. Similar to [20], we concatenate the clip-level feature with contextual video-level feature, and build a two-layer LSTM as a caption generator.”) [Examiner’s note: merged features is being interpreted as the learned embeddings for video captioning]
wherein each of the one or more of the first inference features is a particular common feature to one of the one or more of the second inference features; (Zhang, Page, 8, Section 4.1: “Each video contains multiple clips and a corresponding paragraph with sentences aligned to the clips.”, Page 2, Figure 1: “Conceptual diagram of our approach for cross-modal modeling of video and texts. The main idea is to embed both low-level (clips and sentences) and high-level (video and paragraph) in their own semantic spaces coherently. As shown in the figure, the 3 sentences (and the corresponding 3 clips) are mapped into a local embedding space where the corresponding pairs of clips and sentences are placed close to each other. As a whole, the videos and the paragraphs are mapped into a global semantic space where their embeddings are close.
    PNG
    media_image1.png
    317
    626
    media_image1.png
    Greyscale
”) [Examiner’s note: pairs of clips and sentences in red, blue, green share a common feature i.e., a corresponding paragraph with sentences aligned to the video clips]
identifying, by the one or more processing circuits, unique first classification features of the first subset of features unique to the first data type; (Zhang, Page 4, Section 3.1: “Likewise, we assume there is a paragraph of texts describing the video. The paragraph p contains n sentences, one for each video clip. Let si denote the ith sentence and wij the feature for the jth word out of n’i words.”) (Examiner’s note: first classification features is being interpreted as feature wij  for the jth word of multiple sentences (i.e., first data elements as shown above) for text data)
identifying, by the one or more processing circuits, unique second classification features of the second subset of features unique to the second data type; and (Zhang, Page 4, Section 3.1: “A video v has n clips (or subshots), where each clip ci contains ni frames. Each frame is represented by a visual feature vector xij .”) (Examiner’s note: second classification features is being interpreted as a visual feature vector xij of multiple clips (i.e., second data elements as shown above) for video data)
generating, by the one or more processing circuits, a model output of the model by applying the one or more merged features, the unique first classification features, the unique second classification features as inputs to the model. (Leonardo, Page 2029, Col. 1, Section V.A.1, ¶[2]: “Once the visual and textual intermediate features are extracted and normalized to [0, 1], they are passed into an image and text autoencoder, respectively.”, Page 2029, Col. 2, ¶[2]: “for each web page with one textual feature t and N visual features vi (where N varies for each web page), new data is created by concatenating t and vi for 1 ≤ i ≤ N. These concatenated features are passed into a random forest classifier, and hyperparameters are selected as described in Algorithms 1 and 2.”, Page 2029, Col. 2, ¶[4]: “Thus, a much larger dataset was created by generating random pairs of one visual feature and one textual feature to produce 40,000 total pairings. These pairings are constructed so that 10,000 image-text combinations are produced for each of the four combinations: unsafe images with unsafe text, unsafe images with safe text, safe images with unsafe text, and safe images with safe text. The objective now is to make the CNN learn to predict the safety of image-text pair. The CNN is trained on these random pairings and is tested on the page-level by combining predictions with OR Logic.”) [Examiner’s note: a model output i.e., the image-text pair safety prediction, merged feature i.e., the concatenated textual feature t and visual features vi, first classification feature i.e., the textual feature, second classification feature i.e., the visual feature, the model i.e., the CNN model]
As per claim 8, the combination of Zhang, Kennedy and Leonardo discloses all the limitations of claim 1 (as shown in the rejections above).
Zhang in view of Kennedy and Leonardo further discloses:
wherein the first data type is a text based data type and the second data type is at least one of an image data type or a video data type. (Zhang, Page 2, Figure 1: “Conceptual diagram of our approach for cross-modal modeling of video and texts. The main idea is to embed both low-level (clips and sentences) and high-level (video and paragraph) in their own semantic spaces coherently. 
    PNG
    media_image1.png
    317
    626
    media_image1.png
    Greyscale
”) 
As per claim 9, the combination of Zhang, Kennedy and Leonardo discloses all the limitations of claim 8 (as shown in the rejections above).
Zhang in view of Kennedy and Leonardo further discloses:
wherein at least a first portion of the plurality of first data elements are associated with first data element labels and at least a second portion of the plurality of second data elements are associated with second data element labels, (Leonardo, Page 2029, Col. 1, Section B: “Each of these classifiers is trained to identify if xwp should be labeled as TCi or not, using yi, i = 0, 1, . . . , 9 as the corresponding ground truth for the respective TC.”, Page 2028, Col. 1, Section IV.A.1: “The results from any of the three methods arrive at xwp, a vector that’s comprised of visual and textual features from a web page, while y represents xwp’s ground truth.”, Page 2027, Col. 2, Section B: “Our binary classification algorithms classify every web page as either safe or threat. For multilabel classification, there are 10 target labels that can be assigned to every web page. Some web pages can be classified in multiple threat categories as well. Thus, a binary classifier is trained to identify each threat category (TC), shown in Table I. 
    PNG
    media_image2.png
    323
    569
    media_image2.png
    Greyscale
”) [Examiner’s note: xwp, a vector that’s comprised of visual and textual features from a web page (i.e., first and second portion of the plurality of first data elements) is classified with labels TCi with i = 0, 1, … 9]
wherein a first number of the first data element labels is greater than a second number of the second data element labels; (Leonardo, Page 2027, Col. 2, Section B, ¶[2]: “The training and testing datasets consist of 2643 and 620 web pages respectively, which were annotated by an independent company. Out of the web pages in the training data, 1653 are labeled as safe and 990 as threat. If a web page is a threat, it can belong to multiple TCs.”) [Examiner’s note: the first data element label i.e., safe, the second data element label i.e., threat]
wherein training, by the one or more processing circuits, the model is further based on the first data element labels and the second data element labels. (Leonardo, Page 2029, Col. 2, Section 2, ¶[2]: “Thus, a much larger dataset was created by generating random pairs of one visual feature and one textual feature to produce 40,000 total pairings. These pairings are constructed so that 10,000 image-text combinations are produced for each of the four combinations: unsafe images with unsafe text, unsafe images with safe text, safe images with unsafe text, and safe images with safe text. The objective now is to make the CNN learn to predict the safety of image-text pair. The CNN is trained on these random pairings and is tested on the page-level by combining predictions with OR Logic.”) [Examiner’s note: the first data element label i.e., safe, the second data element label i.e., threat or unsafe ]
As per claim 10, the combination of Zhang, Kennedy and Leonardo discloses all the limitations of claim 8 (as shown in the rejections above).
Zhang in view of Kennedy and Leonardo further discloses:
wherein at least a first portion of the plurality of first data elements are associated with first data element labels and none of the plurality of second data elements are associated with second data element labels; (Leonardo, Page 2027, Col. 2, Section B, ¶[2]: “The training and testing datasets consist of 2643 and 620 web pages respectively, which were annotated by an independent company. Out of the web pages in the training data, 1653 are labeled as safe and 990 as threat. If a web page is a threat, it can belong to multiple TCs.”, Page 2029, Col. 2, Section 2, ¶[2]: “Thus, a much larger dataset was created by generating random pairs of one visual feature and one textual feature to produce 40,000 total pairings. These pairings are constructed so that 10,000 image-text combinations are produced for each of the four combinations: unsafe images with unsafe text, unsafe images with safe text, safe images with unsafe text, and safe images with safe text.) [Examiner’s note: the first data element label i.e., safe, the second data element label i.e., threat or unsafe, first portion of first data elements i.e., text, second portion of second data elements i.e., images. “safe images with safe text” is being interpreted as image data is not associated with threat or unsafe label element.]
wherein training, by the one or more processing circuits, the model is further based on the first data element labels.  (Leonardo, Page 2029, Col. 2, Section 2, ¶[2]: “Thus, a much larger dataset was created by generating random pairs of one visual feature and one textual feature to produce 40,000 total pairings. These pairings are constructed so that 10,000 image-text combinations are produced for each of the four combinations: unsafe images with unsafe text, unsafe images with safe text, safe images with unsafe text, and safe images with safe text. The objective now is to make the CNN learn to predict the safety of image-text pair. The CNN is trained on these random pairings and is tested on the page-level by combining predictions with OR Logic.”) [Examiner’s note: first data element label i.e., the safe label in “safe images with safe text” data is used to train the CNN model]
	As per claim 13, the combination of Zhang and Kennedy discloses all the limitations of claim 11 (as shown in the rejections below).
	Zhang in view of Kennedy further discloses:
wherein the instructions cause the one or more processors to identify the first features and the second features comprise (Zhang, Page 4, Section 3.1: “Likewise, we assume there is a paragraph of texts describing the video. The paragraph p contains n sentences, one for each video clip. Let si denote the ith sentence and wij the feature for the jth word out of n’i words.”, Zhang, Page 4, Section 3.1: “A video v has n clips (or subshots), where each clip ci contains ni frames. Each frame is represented by a visual feature vector xij .”) (Examiner’s note: features of first data elements is being interpreted as feature wij  for the jth word of multiple sentences (i.e., first data elements as shown above) for text data, features of second data elements is being interpreted as a visual feature vector xij of multiple clips (i.e., second data elements as shown above) for video data)
Zhang in view of Kennedy fails to disclose:
applying one or more models to the plurality of first data elements and the plurality of second data elements,
wherein the one or more models extract the first features from the plurality of first data elements and extract the second features from the plurality of second data elements.
	However, Leonardo explicitly discloses:
applying one or more models to the plurality of first data elements and the plurality of second data elements, (Leonardo, Page 2026, Col. 2, ¶[2]: “In early fusion, intermediate features from separate modalities are extracted and jointly represented, then learned with one single model. Given separate, pre-trained computer vision (CV) and natural language processing (NLP) models, our fully automated framework uses late fusion to classify web pages as either safe or threat, as well as into 10 possible threat categories.”, Leonardo, Page 2027, Col. 1, Section III.A, ¶[1-2]: “We are analyzing two modalities present in a web page: visual signals and natural language. One classifier for each modality was trained to determine content safety. For the CV model, a Squeeze-and-Excitation (SE) Network [9] was trained using initial weights from ImageNet’s image classification task [10] and fine-tuned on web page images. For the NLP model, a Universal Language Model Finetuning Framework (ULMFiT) [11] was trained and fine-tuned.”,) [Examiner’s note: “one or more models” is being interpreted as the pre-trained computer vision (CV) and natural language processing (NPL) models, first data elements i.e., natural language of web pages, second data elements i.e., visual signals or images of web pages]
wherein the one or more models extract the first features from the plurality of first data elements and extract the second features from the plurality of second data elements. (Leonardo, Page 2029, Col. 2, Section 2: “CNNs are used extensively for difficult classification tasks due to their ability to find patterns in the input data with high accuracy [22]. They are especially useful on spatial input data in which the output depends on the position of each individual feature. For this work, CNNs are used on the intermediate features extracted from the CV model and the NLP model to perform binary classification.”) [Examiner’s note: “one or more models” is being interpreted as the pre-trained computer vision (CV) and natural language processing (NPL) models, first and second features of first and second data elements i.e., the intermediate features of text data and image data] 
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Zhang, Kennedy and Leonardo. Zhang teaches a novel cross-modal learning approach to model both videos and texts jointly. Kennedy teaches system and method for associating textual summaries with content media. Leonardo teaches a multimodal machine learning framework that fuses visual and textual information from web pages to improve current predictions of content safety. One of ordinary skill would have motivation to combine Zhang, Kennedy and Leonardo to improve the classification of data by using multimodal machine learning to fuse various modalities, which will help to enhance the accuracy of classification model (Leonardo, Page 2027, Col. 1, Section II)
As per claim 14, the combination of Zhang and Kennedy discloses all the limitations of claim 11 (as shown in the rejections below).
Zhang in view of Kennedy fails to disclose:
wherein combining the first feature with the second feature comprises performing an operation on a first value of the first feature representing a first confidence of the first feature with a second value of the second feature representing a second confidence of the second feature.
	However, Leonardo explicitly discloses:
wherein combining the first feature with the second feature comprises performing an operation on a first value of the first feature representing a first confidence of the first feature with a second value of the second feature representing a second confidence of the second feature.  (Leonardo, Page 2028, Col. 1, Section IV.A.1: “Since unique web pages in the dataset are often associated with multiple images, the visual features of these images are combined first by choosing the minimum value of x0 (confidence score for TC0) and maximum value for xi (confidence score for threat TCi, i = 1, 2, . . . , 9), componentwise across all the images corresponding to the same web page. The method used for merging different x’s using the aforementioned criteria is named minmax.… Now, a single 10-dimensional vector represents the features from all the images in a web page and another 10-dimensional vector represents the textual features. In order to merge both and train them altogether, we can average each component, concatenate them both or use the minmax method. The results from any of the three methods arrive at xwp, a vector that’s comprised of visual and textual features from a web page, while y represents xwp’s ground truth.”) [Examiner’s note: an operation i.e., the minmax method, first value representing a first confidence i.e., the confidence score for visual features, the second value representing a second confidence i.e., the confidence score for textual features]
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Zhang, Kennedy and Leonardo. Zhang teaches a novel cross-modal learning approach to model both videos and texts jointly. Kennedy teaches system and method for associating textual summaries with content media. Leonardo teaches a multimodal machine learning framework that fuses visual and textual information from web pages to improve current predictions of content safety. One of ordinary skill would have motivation to combine Zhang, Kennedy and Leonardo to improve the classification of data by using multimodal machine learning to fuse various modalities, which will help to enhance the accuracy of classification model (Leonardo, Page 2027, Col. 1, Section II)
As per claim 15, the combination of Zhang and Kennedy discloses all the limitations of Claim 11 (as shown in the rejections below).
Zhang in view of Kennedy further discloses:
wherein the instructions cause the one or more processors to: receive a data element comprising a first data element of the first data type and a second data element of the second data type; (Zhang, Page 2, Figure 1: “Conceptual diagram of our approach for cross-modal modeling of video and texts. The main idea is to embed both low-level (clips and sentences) and high-level (video and paragraph) in their own semantic spaces coherently. 
    PNG
    media_image1.png
    317
    626
    media_image1.png
    Greyscale
”) [Examiner’s note: a first data elements i.e., multiple sentences for text data, a second data elements i.e., multiple clips for video data]
extract first inference features of the first data element and second inference features of the second data element; (Zhang, Page 18, Section A.1, ¶[1]: “In all our experiments under this setting, we extract frame-wise video feature using C3D model pre-trained on Sports-1M dataset, with the temporal stride of 16.”, Page 18, Section A.1, ¶[3]: “Word Features. In the retrieval related experiments, we always use GloVE features [30] for the initialization of the word embedding and fine-tune. Specifically, we use the GloVE vectors pre-trained on 840B common web-crawled data, with its dimensionality equals to 300.”) [Examiner’s note: first inference features i.e., the video features, second inference features i.e., word features ]
generate one or more merged features by combining one or more of the first inference features with one or more of the second inference features, (Zhang, Page 12, Section 4.3: “In addition to the video paragraph retrieval, we evaluate our learned embeddings for video captioning. Specifically, we follow [20] and train a caption model [40] on top of the pre-trained video embeddings. Similar to [20], we concatenate the clip-level feature with contextual video-level feature, and build a two-layer LSTM as a caption generator.”) [Examiner’s note: merged features is being interpreted as the learned embeddings for video captioning]
wherein each of the one or more of the first inference features is a particular common feature to one of the one or more of the second inference features; (Zhang, Page, 8, Section 4.1: “Each video contains multiple clips and a corresponding paragraph with sentences aligned to the clips.”, Page 2, Figure 1: “Conceptual diagram of our approach for cross-modal modeling of video and texts. The main idea is to embed both low-level (clips and sentences) and high-level (video and paragraph) in their own semantic spaces coherently. As shown in the figure, the 3 sentences (and the corresponding 3 clips) are mapped into a local embedding space where the corresponding pairs of clips and sentences are placed close to each other. As a whole, the videos and the paragraphs are mapped into a global semantic space where their embeddings are close.
    PNG
    media_image1.png
    317
    626
    media_image1.png
    Greyscale
”) [Examiner’s note: pairs of clips and sentences in red, blue, green share a common feature i.e., a corresponding paragraph with sentences aligned to the video clips]
identify unique first classification features of the first subset of features unique to the first data type; (Zhang, Page 4, Section 3.1: “Likewise, we assume there is a paragraph of texts describing the video. The paragraph p contains n sentences, one for each video clip. Let si denote the ith sentence and wij the feature for the jth word out of n’i words.”) (Examiner’s note: first classification features is being interpreted as feature wij  for the jth word of multiple sentences (i.e., first data elements as shown above) for text data)
identify unique second classification features of the second subset of features unique to the second data type; and (Zhang, Page 4, Section 3.1: “A video v has n clips (or subshots), where each clip ci contains ni frames. Each frame is represented by a visual feature vector xij .”) (Examiner’s note: second classification features is being interpreted as a visual feature vector xij of multiple clips (i.e., second data elements as shown above) for video data)
Zhang in view of Kennedy fails to disclose:
generate a model output of the model by applying the one or more merged features, the unique first classification features, the unique second classification features as inputs to the model.  
	However, Leonardo explicitly discloses:
generate a model output of the model by applying the one or more merged features, the unique first classification features, the unique second classification features as inputs to the model.  (Leonardo, Page 2029, Col. 1, Section V.A.1, ¶[2]: “Once the visual and textual intermediate features are extracted and normalized to [0, 1], they are passed into an image and text autoencoder, respectively.”, Page 2029, Col. 2, ¶[2]: “for each web page with one textual feature t and N visual features vi (where N varies for each web page), new data is created by concatenating t and vi for 1 ≤ i ≤ N. These concatenated features are passed into a random forest classifier, and hyperparameters are selected as described in Algorithms 1 and 2.”, Page 2029, Col. 2, ¶[4]: “Thus, a much larger dataset was created by generating random pairs of one visual feature and one textual feature to produce 40,000 total pairings. These pairings are constructed so that 10,000 image-text combinations are produced for each of the four combinations: unsafe images with unsafe text, unsafe images with safe text, safe images with unsafe text, and safe images with safe text. The objective now is to make the CNN learn to predict the safety of image-text pair. The CNN is trained on these random pairings and is tested on the page-level by combining predictions with OR Logic.”) [Examiner’s note: a model output i.e., the image-text pair safety prediction, merged feature i.e., the concatenated textual feature t and visual features vi, first classification feature i.e., the textual feature, second classification feature i.e., the visual feature, the model i.e., the CNN model]
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Zhang, Kennedy and Leonardo. Zhang teaches a novel cross-modal learning approach to model both videos and texts jointly. Kennedy teaches system and method for associating textual summaries with content media. Leonardo teaches a multimodal machine learning framework that fuses visual and textual information from web pages to improve current predictions of content safety. One of ordinary skill would have motivation to combine Zhang, Kennedy and Leonardo to improve the classification of data by using multimodal machine learning to fuse various modalities, which will help to enhance the accuracy of classification model (Leonardo, Page 2027, Col. 1, Section II)
As per claim 16, the combination of Zhang and Kennedy and Leonardo discloses all the limitations of Claim 15 (as shown in the rejections above).
Zhang in view of Kennedy and Leonardo further discloses:
wherein the data element is a content item comprising multiple content types, wherein the first data element is text data while the second data element is at least one of image data or video data. (Zhang, Page 2, Figure 1: “Conceptual diagram of our approach for cross-modal modeling of video and texts. The main idea is to embed both low-level (clips and sentences) and high-level (video and paragraph) in their own semantic spaces coherently. 
    PNG
    media_image1.png
    317
    626
    media_image1.png
    Greyscale
”) [Examiner’s note: a first data elements i.e., multiple sentences for text data, a second data elements i.e., multiple clips for video data]
As per claim 18, the combination of Zhang and Kennedy discloses all the limitations of Claim 17 (as shown in the rejections above).
	Zhang in view of Kennedy fails to disclose:
wherein at least a first portion of the plurality of first data elements are associated with first data element labels and at least a second portion of the plurality of second data elements are associated with second data element labels,
wherein a first number of the first data element labels is greater than a second number of the second data element labels;
wherein the instructions cause the one or more processors to train the model further based on the first data element labels and the second data element labels
	However, Leonardo explicitly discloses:
wherein at least a first portion of the plurality of first data elements are associated with first data element labels and at least a second portion of the plurality of second data elements are associated with second data element labels, (Leonardo, Page 2029, Col. 1, Section B: “Each of these classifiers is trained to identify if xwp should be labeled as TCi or not, using yi, i = 0, 1, . . . , 9 as the corresponding ground truth for the respective TC.”, Page 2028, Col. 1, Section IV.A.1: “The results from any of the three methods arrive at xwp, a vector that’s comprised of visual and textual features from a web page, while y represents xwp’s ground truth.”, Page 2027, Col. 2, Section B: “Our binary classification algorithms classify every web page as either safe or threat. For multilabel classification, there are 10 target labels that can be assigned to every web page. Some web pages can be classified in multiple threat categories as well. Thus, a binary classifier is trained to identify each threat category (TC), shown in Table I. 
    PNG
    media_image2.png
    323
    569
    media_image2.png
    Greyscale
”) [Examiner’s note: xwp, a vector that’s comprised of visual and textual features from a web page (i.e., first and second portion of the plurality of first data elements) is classified with labels TCi with i = 0, 1, … 9]
wherein a first number of the first data element labels is greater than a second number of the second data element labels; (Leonardo, Page 2027, Col. 2, Section B, ¶[2]: “The training and testing datasets consist of 2643 and 620 web pages respectively, which were annotated by an independent company. Out of the web pages in the training data, 1653 are labeled as safe and 990 as threat. If a web page is a threat, it can belong to multiple TCs.”) [Examiner’s note: the first data element label i.e., safe, the second data element label i.e., threat]
wherein the instructions cause the one or more processors to train the model further based on the first data element labels and the second data element labels. (Leonardo, Page 2029, Col. 2, Section 2, ¶[2]: “Thus, a much larger dataset was created by generating random pairs of one visual feature and one textual feature to produce 40,000 total pairings. These pairings are constructed so that 10,000 image-text combinations are produced for each of the four combinations: unsafe images with unsafe text, unsafe images with safe text, safe images with unsafe text, and safe images with safe text. The objective now is to make the CNN learn to predict the safety of image-text pair. The CNN is trained on these random pairings and is tested on the page-level by combining predictions with OR Logic.”) [Examiner’s note: the first data element label i.e., safe, the second data element label i.e., threat or unsafe ]  
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Zhang, Kennedy and Leonardo. Zhang teaches a novel cross-modal learning approach to model both videos and texts jointly. Kennedy teaches system and method for associating textual summaries with content media. Leonardo teaches a multimodal machine learning framework that fuses visual and textual information from web pages to improve current predictions of content safety. One of ordinary skill would have motivation to combine Zhang, Kennedy and Leonardo to improve the classification of data by using multimodal machine learning to fuse various modalities, which will help to enhance the accuracy of classification model (Leonardo, Page 2027, Col. 1, Section II)
	As per claim 19, the combination of Zhang and Kennedy discloses all the limitations of Claim 17 (as shown in the rejections below).
	Zhang in view of Kennedy fails to disclose:
wherein at least a first portion of the plurality of first data elements are associated with first data element labels and none of the plurality of second data elements are associated with second data element labels;
wherein the instructions cause the one or more processors to train the model further based on the first data element labels.  
	However, Leonardo explicitly discloses:
wherein at least a first portion of the plurality of first data elements are associated with first data element labels and none of the plurality of second data elements are associated with second data element labels; (Leonardo, Page 2027, Col. 2, Section B, ¶[2]: “The training and testing datasets consist of 2643 and 620 web pages respectively, which were annotated by an independent company. Out of the web pages in the training data, 1653 are labeled as safe and 990 as threat. If a web page is a threat, it can belong to multiple TCs.”, Page 2029, Col. 2, Section 2, ¶[2]: “Thus, a much larger dataset was created by generating random pairs of one visual feature and one textual feature to produce 40,000 total pairings. These pairings are constructed so that 10,000 image-text combinations are produced for each of the four combinations: unsafe images with unsafe text, unsafe images with safe text, safe images with unsafe text, and safe images with safe text.) [Examiner’s note: the first data element label i.e., safe, the second data element label i.e., threat or unsafe, first portion of first data elements i.e., text, second portion of second data elements i.e., images. “safe images with safe text” is being interpreted as image data is not associated with threat or unsafe label element.]
wherein the instructions cause the one or more processors to train the model further based on the first data element labels.  (Leonardo, Page 2029, Col. 2, Section 2, ¶[2]: “Thus, a much larger dataset was created by generating random pairs of one visual feature and one textual feature to produce 40,000 total pairings. These pairings are constructed so that 10,000 image-text combinations are produced for each of the four combinations: unsafe images with unsafe text, unsafe images with safe text, safe images with unsafe text, and safe images with safe text. The objective now is to make the CNN learn to predict the safety of image-text pair. The CNN is trained on these random pairings and is tested on the page-level by combining predictions with OR Logic.”) [Examiner’s note: first data element label i.e., the safe label in “safe images with safe text” data is used to train the CNN model]
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Zhang, Kennedy and Leonardo. Zhang teaches a novel cross-modal learning approach to model both videos and texts jointly. Kennedy teaches system and method for associating textual summaries with content media. Leonardo teaches a multimodal machine learning framework that fuses visual and textual information from web pages to improve current predictions of content safety. One of ordinary skill would have motivation to combine Zhang, Kennedy and Leonardo to improve the classification of data by using multimodal machine learning to fuse various modalities, which will help to enhance the accuracy of classification model (Leonardo, Page 2027, Col. 1, Section II)

Claim(s) 11, 12, 17, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Zhang et al. (“Cross-Modal and Hierarchical Modeling of Video and Text”) (hereafter referred to as “Zhang”) in view of Kennedy et al (US 10,789,284 B2) (hereafter referred to as “Kennedy”)
	As per claim 11, Zhang explicitly discloses:
receive a plurality of first data elements of a first data type and a plurality of second data elements of a second data type; (Zhang, Page 2, Figure 1: “Conceptual diagram of our approach for cross-modal modeling of video and texts. The main idea is to embed both low-level (clips and sentences) and high-level (video and paragraph) in their own semantic spaces coherently. 
    PNG
    media_image1.png
    317
    626
    media_image1.png
    Greyscale
”) [Examiner’s note: a first data elements i.e., multiple sentences for text data, a second data elements i.e., multiple clips for video data]
identify first features of each of the plurality of first data elements; (Zhang, Page 4, Section 3.1: “Likewise, we assume there is a paragraph of texts describing the video. The paragraph p contains n sentences, one for each video clip. Let si denote the ith sentence and wij the feature for the jth word out of n’i words.”) (Examiner’s note: features of first data elements is being interpreted as feature wij  for the jth word of multiple sentences (i.e., first data elements as shown above) for text data)
identify second features of each of the plurality of second data elements; (Zhang, Page 4, Section 3.1: “A video v has n clips (or subshots), where each clip ci contains ni frames. Each frame is represented by a visual feature vector xij .”) (Examiner’s note: features of second data elements is being interpreted as a visual feature vector xij of multiple clips (i.e., second data elements as shown above) for video data)
generate merged features by combining a first feature of the first features of each of the plurality of first data elements with a second feature of the second features of one of the plurality of second data elements, (Zhang, Page 12, Section 4.3: “In addition to the video paragraph retrieval, we evaluate our learned embeddings for video captioning. Specifically, we follow [20] and train a caption model [40] on top of the pre-trained video embeddings. Similar to [20], we concatenate the clip-level feature with contextual video-level feature, and build a two-layer LSTM as a caption generator.”) [Examiner’s note: merged features is being interpreted as the learned embeddings for video captioning]
wherein the first feature and the second feature each represent a common feature; and (Zhang, Page, 8, Section 4.1: “Each video contains multiple clips and a corresponding paragraph with sentences aligned to the clips.”, Page 2, Figure 1: “Conceptual diagram of our approach for cross-modal modeling of video and texts. The main idea is to embed both low-level (clips and sentences) and high-level (video and paragraph) in their own semantic spaces coherently. As shown in the figure, the 3 sentences (and the corresponding 3 clips) are mapped into a local embedding space where the corresponding pairs of clips and sentences are placed close to each other. As a whole, the videos and the paragraphs are mapped into a global semantic space where their embeddings are close.
    PNG
    media_image1.png
    317
    626
    media_image1.png
    Greyscale
”) [Examiner’s note: pairs of clips and sentences in red, blue, green share a common feature i.e., a corresponding paragraph with sentences aligned to the video clips]
Zhang fails to disclose:
A system including one or more memory devices configured to store instructions thereon, that, when executed by one or more processors, cause the one or more processors to
training, by the one or more processing circuits, a model based on the merged features and at least a first portion of the first features and a second portion of the second features,
wherein the first portion of the first features is a first subset of features unique to the first data type of the plurality of first data elements and the second portion of the second features is a second subset of features unique to the second data type of the plurality of second data elements 
classify a content item based on the model, wherein the content item includes content text and at least one of a content image or a content video
However, Kennedy explicitly discloses
A system including one or more memory devices configured to store instructions thereon, that, when executed by one or more processors, cause the one or more processors to (Kennedy, Col. 15, Lines 48-55: “Computing device 805 in computing environment 800 can include one or more processing units, cores, or processors 810, memory 815 (e.g., 50 RAM, ROM, and/or the like), internal storage 820 (e.g., magnetic, optical, solid state storage, and/or organic), and/or I/O interface 825, any of which can be coupled on a communication mechanism or bus 830 for communicating information or embedded in the computing device 805.”)
training, by the one or more processing circuits, a model based on the merged features and at least a first portion of the first features and a second portion of the second features, (Kennedy, Col. 5, Lines 30-47: “As illustrated in FIG. 1, user input is received by the system at 105. This user input may include a user's set of images and user-provided captions that the user would like illustrated. After the user input has been received, the system may generate caption-image pairs at 110 by pairing each provided caption with each possible image in the user provided image set. At 115, features may be extracted for both the captions and images to produce featuring embeddings associated with both the captions and the images. The feature extraction of 115 may be performed using a neural network that has been trained at 135 using training data. The training data may include user manually-generated image collection or photo books that have been captioned by users using photo album editing or photo book generation platforms, which may be web-based, mobile application based, or installed on the local machine. For each image/caption pair, the features (embeddings) may be concatenated together.”) [Examiner’s note: “the merged features” is being interpreted as the “image-caption pair data”, “a first portion of the first features and a second portion of the second features” is being interpreted as the “featuring embeddings associated with both the captions and the images”]
wherein the first portion of the first features is a first subset of features unique to the first data type of the plurality of first data elements and the second portion of the second features is a second subset of features unique to the second data type of the plurality of second data elements 
(Kennedy, Col. 11, Lines 56-63: “In the process 500, a plurality of images or content representations may be clustered or divided into subsets of images or content representations based on content similarities between different images or content representations at 505. This may include extracting content features from the content representations or images to produce featuring embeddings. The feature extraction may be performed using a neural network that has been trained using training data.”)
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Zhang and Kennedy. Zhang teaches a novel cross-modal learning approach to model both videos and texts jointly. Kennedy teaches system and method for associating textual summaries with content media. One of ordinary skill would have motivation to combine Zhang and Kennedy to enable the model to learn joint representations that capture correlations across different data types (e.g., text and image), while the retention of unique feature subsets preserves modality-specific information that may otherwise be diluted or list during fusion. This approach enhances learning capacity and flexibility, particularly in multi-modal systems where different data sources contribute complementary insights.
	As per claim 12, the combination of Zhang and Kennedy discloses all the limitations of claim 11 (as shown in the rejections above).
	Zhang in view of Kennedy further discloses:
wherein each of the plurality of first data elements is associated with one of the plurality of second data elements; (Zhang, Page 8, Section 4.1, ¶[1]: “Each video contains multiple clips and a corresponding paragraph with sentences aligned to the clips.”, Page 2, Figure 1: 
    PNG
    media_image1.png
    317
    626
    media_image1.png
    Greyscale
”) [Examiner’s note: first data elements i.e., multiple sentences of text data, second data elements i.e., multiple clips of video data]
wherein the instructions cause the one or more processors to generate the merged features comprises combining the first feature of the first features of each of the plurality of first data elements with the second feature of the second features of the one of the plurality of second data elements that each of the plurality of first data elements is associated with. (Zhang, Page 12, Section 4.3: “In addition to the video paragraph retrieval, we evaluate our learned embeddings for video captioning. Specifically, we follow [20] and train a caption model [40] on top of the pre-trained video embeddings. Similar to [20], we concatenate the clip-level feature with contextual video-level feature, and build a two-layer LSTM as a caption generator.”) [Examiner’s note: merged features is being interpreted as the learned embeddings for video captioning]
As per claim 17, the combination of Zhang and Kennedy discloses all the limitations of claim 11 (as shown in the rejections above).
	Zhang in view of Kennedy further discloses:
wherein the first data type is a text based data type and the second data type is at least one of an image data type or a video data type. (Zhang, Page 2, Figure 1: “Conceptual diagram of our approach for cross-modal modeling of video and texts. The main idea is to embed both low-level (clips and sentences) and high-level (video and paragraph) in their own semantic spaces coherently. 
    PNG
    media_image1.png
    317
    626
    media_image1.png
    Greyscale
”)
As per claim 20, Zhang explicitly discloses:
receive a plurality of first data elements of a first data type and a plurality of second data elements of a second data type; (Zhang, Page 2, Figure 1: “Conceptual diagram of our approach for cross-modal modeling of video and texts. The main idea is to embed both low-level (clips and sentences) and high-level (video and paragraph) in their own semantic spaces coherently. 
    PNG
    media_image1.png
    317
    626
    media_image1.png
    Greyscale
”) [Examiner’s note: a first data elements i.e., multiple sentences for text data, a second data elements i.e., multiple clips for video data]
identify first features of each of the plurality of first data elements; (Zhang, Page 4, Section 3.1: “Likewise, we assume there is a paragraph of texts describing the video. The paragraph p contains n sentences, one for each video clip. Let si denote the ith sentence and wij the feature for the jth word out of n’i words.”) (Examiner’s note: features of first data elements is being interpreted as feature wij  for the jth word of multiple sentences (i.e., first data elements as shown above) for text data)
identify second features of each of the plurality of second data elements; (Zhang, Page 4, Section 3.1: “A video v has n clips (or subshots), where each clip ci contains ni frames. Each frame is represented by a visual feature vector xij .”) (Examiner’s note: features of second data elements is being interpreted as a visual feature vector xij of multiple clips (i.e., second data elements as shown above) for video data)
generate merged features by combining a first feature of the first features of each of the plurality of first data elements with a second feature of the second features of one of the plurality of second data elements, (Zhang, Page 12, Section 4.3: “In addition to the video paragraph retrieval, we evaluate our learned embeddings for video captioning. Specifically, we follow [20] and train a caption model [40] on top of the pre-trained video embeddings. Similar to [20], we concatenate the clip-level feature with contextual video-level feature, and build a two-layer LSTM as a caption generator.”) [Examiner’s note: merged features is being interpreted as the learned embeddings for video captioning]
wherein the first feature and the second feature each represent a common feature; and (Zhang, Page, 8, Section 4.1: “Each video contains multiple clips and a corresponding paragraph with sentences aligned to the clips.”, Page 2, Figure 1: “Conceptual diagram of our approach for cross-modal modeling of video and texts. The main idea is to embed both low-level (clips and sentences) and high-level (video and paragraph) in their own semantic spaces coherently. As shown in the figure, the 3 sentences (and the corresponding 3 clips) are mapped into a local embedding space where the corresponding pairs of clips and sentences are placed close to each other. As a whole, the videos and the paragraphs are mapped into a global semantic space where their embeddings are close.
    PNG
    media_image1.png
    317
    626
    media_image1.png
    Greyscale
”) [Examiner’s note: pairs of clips and sentences in red, blue, green share a common feature i.e., a corresponding paragraph with sentences aligned to the video clips]
	Zhang fails to disclose:
One or more computer readable storage media configured to store instructions thereon that, when executed by one or more processors, cause the one or more processors to:
training, by the one or more processing circuits, a model based on the merged features and at least a first portion of the first features and a second portion of the second features,
wherein the first portion of the first features is a first subset of features unique to the first data type of the plurality of first data elements and the second portion of the second features is a second subset of features unique to the second data type of the plurality of second data elements 
However, Kennedy explicitly discloses
by one or more processing circuits (Kennedy, Col. 16, Lines 59-67: “Processor(s) 810 can execute under any operating system (OS) (not shown), in a native or virtual environment. One or more applications can be deployed that include logic unit 855, application programming interface (API) unit 860, input unit 865, output unit 870, pairing unit 875, feature extraction unit 880, ranking unit 885, pair association unit 890, and inter-unit communication mechanism 895 for the different units to communicate with each other, with the OS, and with other applications (not shown).”)
training, by the one or more processing circuits, a model based on the merged features and at least a first portion of the first features and a second portion of the second features, (Kennedy, Col. 5, Lines 30-47: “As illustrated in FIG. 1, user input is received by the system at 105. This user input may include a user's set of images and user-provided captions that the user would like illustrated. After the user input has been received, the system may generate caption-image pairs at 110 by pairing each provided caption with each possible image in the user provided image set. At 115, features may be extracted for both the captions and images to produce featuring embeddings associated with both the captions and the images. The feature extraction of 115 may be performed using a neural network that has been trained at 135 using training data. The training data may include user manually-generated image collection or photo books that have been captioned by users using photo album editing or photo book generation platforms, which may be web-based, mobile application based, or installed on the local machine. For each image/caption pair, the features (embeddings) may be concatenated together.”) [Examiner’s note: “the merged features” is being interpreted as the “image-caption pair data”, “a first portion of the first features and a second portion of the second features” is being interpreted as the “featuring embeddings associated with both the captions and the images”]
wherein the first portion of the first features is a first subset of features unique to the first data type of the plurality of first data elements and the second portion of the second features is a second subset of features unique to the second data type of the plurality of second data elements  (Kennedy, Col. 11, Lines 56-63: “In the process 500, a plurality of images or content representations may be clustered or divided into subsets of images or content representations based on content similarities between different images or content representations at 505. This may include extracting content features from the content representations or images to produce featuring embeddings. The feature extraction may be performed using a neural network that has been trained using training data.”)
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Zhang and Kennedy. Zhang teaches a novel cross-modal learning approach to model both videos and texts jointly. Kennedy teaches system and method for associating textual summaries with content media. One of ordinary skill would have motivation to combine Zhang and Kennedy to enable the model to learn joint representations that capture correlations across different data types (e.g., text and image), while the retention of unique feature subsets preserves modality-specific information that may otherwise be diluted or list during fusion. This approach enhances learning capacity and flexibility, particularly in multi-modal systems where different data sources contribute complementary insights. 
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to AMY TRAN whose telephone number is (571)270-0693. The examiner can normally be reached Monday - Friday 7:30 am - 5:00 pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, David Yi can be reached at (571) 270-7519. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/AMY TRAN/Examiner, Art Unit 2126                                                                                                                                                                                                        
/DAVID YI/Supervisory Patent Examiner, Art Unit 2126
Read full office action
Prosecution Timeline

May 27, 2021
Application Filed
Sep 12, 2024
Non-Final Rejection — §101, §103
Jan 10, 2025
Interview Requested
Jan 21, 2025
Applicant Interview (Telephonic)
Jan 21, 2025
Examiner Interview Summary
Jan 23, 2025
Response Filed
Apr 22, 2025
Final Rejection — §101, §103
Jul 28, 2025
Interview Requested
Aug 05, 2025
Applicant Interview (Telephonic)
Aug 05, 2025
Examiner Interview Summary
Aug 06, 2025
Request for Continued Examination
Aug 12, 2025
Response after Non-Final Action
Jan 27, 2026
Non-Final Rejection — §101, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/226,399
Patent 12602582
DYNAMIC DISTRIBUTED TRAINING OF MACHINE LEARNING MODELS
2y 5m to grant Granted Apr 14, 2026
17/137,588
Patent 12468932
IDENTIFYING RELATED MESSAGES IN A NATURAL LANGUAGE INTERACTION
2y 5m to grant Granted Nov 11, 2025
16/996,310
Patent 12462185
SCENE GRAMMAR BASED REINFORCEMENT LEARNING IN AGENT TRAINING
2y 5m to grant Granted Nov 04, 2025
17/111,611
Patent 12423589
TRAINING DECISION TREE-BASED PREDICTIVE MODELS
2y 5m to grant Granted Sep 23, 2025
16/261,092
Patent 12288074
GENERATING AND PROVIDING PROPOSED DIGITAL ACTIONS IN HIGH-DIMENSIONAL ACTION SPACES USING REINFORCEMENT LEARNING MODELS
2y 5m to grant Granted Apr 29, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
36%
Grant Probability
84%
With Interview (+47.9%)
5y 2m
Median Time to Grant
High
PTA Risk
Based on 28 resolved cases by this examiner. Grant probability derived from career allow rate.