Last updated: April 19, 2026
Application No. 18/394,615
Systems and Methods for Model Training Based on Feature Fusion of Multiple Data Types

Non-Final OA §101§103§DP
Filed
Dec 22, 2023
Examiner
TRAN, AMY NMN
Art Unit
2126
Tech Center
2100 — Computer Architecture & Software
Assignee
Google LLC
OA Round
1 (Non-Final)
This examiner grants 36% of cases after interview

— +47.9% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 28 resolved cases, 2023–2026
Examiner Intelligence

TRAN, AMY NMN View full profile →
Grants only 36% of cases
Career Allow Rate
10 granted / 28 resolved
-19.3% vs TC avg
Strong +48% interview lift
Without
With
+47.9%
Interview Lift
resolved cases with interview
Typical timeline
5y 2m
Avg Prosecution
24 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
32.5%
-7.5% vs TC avg
§103
44.2%
+4.2% vs TC avg
§102
6.0%
-34.0% vs TC avg
§112
15.6%
-24.4% vs TC avg
Black line = Tech Center average estimate • Based on career data from 28 resolved cases
Office Action

§101 §103 §DP
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 12/22/2023 and 09/20/2024 is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.
Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA  as explained in MPEP § 2159. See MPEP § 2146 et seq. for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). 
The filing of a terminal disclaimer by itself is not a complete reply to a nonstatutory double patenting (NSDP) rejection. A complete reply requires that the terminal disclaimer be accompanied by a reply requesting reconsideration of the prior Office action. Even where the NSDP rejection is provisional the reply must be complete. See MPEP § 804, subsection I.B.1. For a reply to a non-final Office action, see 37 CFR 1.111(a). For a reply to final Office action, see 37 CFR 1.113(c). A request for reconsideration while not provided for in 37 CFR 1.113(c) may be filed after final for consideration. See MPEP §§ 706.07(e) and 714.13.
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The actual filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/apply/applying-online/eterminal-disclaimer.
Claims 1-20 are provisionally rejected on the ground of nonstatutory double patenting as being unpatentable over claims 1-15 and 17-19 of copending Application No. 17/297,839 (reference application). Although the claims at issue are not identical, they are not patentably distinct from each other because the copending ‘839 Application discloses all of the limitations of the instant claims as shown below.
This is a provisional nonstatutory double patenting rejection because the patentably indistinct claims have not in fact been patented.
Regarding claim 1, claim 1 of the ‘839 Application discloses all of the limitations of claim 1 as shown below:
Instant Application Claim 1
‘839 Application Claim 1
A method comprising: receiving, by one or more processing circuits, a plurality of first data elements of a first data type and a plurality of second data elements of a second data type, wherein the first data type is text data and the second data type is at least one of image data or video data
A method comprising: receiving, by one or more processing circuits, a plurality of first data elements of a first data type and a plurality of second data elements of a second data type, wherein the first data type is text data and the second data type is at least one of image data or video data
 identifying, by the one or more processing circuits, first features of each of the plurality of first data elements;
identifying, by the one or more processing circuits, first features of each of the plurality of first data elements;
identifying, by the one or more processing circuits, second features of each of the plurality of second data elements;
identifying, by the one or more processing circuits, second features of each of the plurality of second data elements;
generating, by the one or more processing circuits, merged features by combining a first feature of the first features of each of the plurality of first data elements with a second feature of the second features of one of the plurality of second data elements, wherein the first feature and the second feature each represent a common feature;
generating, by the one or more processing circuits, merged features by combining a first feature of the first features of each of the plurality of first data elements with a second feature of the second features of one of the plurality of second data elements, wherein the first feature and the second feature each represent a common feature;
and generating, by the one or more processing circuits, a model based on the common features and at least a portion of the first features and the second features.
training, by the one or more processing circuits, a model based on the merged features and at least a portion of the first features and the second features;


Dependent claims on claim 1:
Instant Application
‘839 Application
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
10
10

	
Regarding Claim 11, claim 11 of the ‘839 Application discloses all of the limitations of claim 11 as shown below:
Instant Application Claim 11
‘839 Application Claim 11
A system comprising one or more memory devices storing instructions that, when executed by one or more processors, cause the one or more processors to:
A system including one or more memory devices configured to store instructions thereon, that, when executed by one or more processors, cause the one or more processors to:
receive a plurality of first data elements of a first data type and a plurality of second data elements of a second data type
receive a plurality of first data elements of a first data type and a plurality of second data elements of a second data type; 
identify first features of each of the plurality of first data elements;
identify first features of each of the plurality of first data elements;
identify second features of each of the plurality of second data elements;
identify features of each of the plurality of second data elements;
generate merged features by combining a first feature of the first features of each of the plurality of first data elements with a second feature of the second features of one of the plurality of second data elements, wherein the first feature and the second feature each represent a common feature;
 generate merged features by combining a first feature of the first features of each of the plurality of first data elements with a second feature of the second features of one of the plurality of second data elements, wherein the first feature and the second feature each represent a common feature;
and generate a model based on the common features and at least a portion of the first features and the second features.
and train a model based on the merged features and at least a portion of the first features and the second features.


Dependent claims on claim 11:
Instant Application 
‘839 Application
12
12
13
13
14
4
15
14
16
6
17
15
18
17
19
18
20
19


Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-20 are rejected under U.S.C 101 for containing an abstract idea without significantly more.
Regarding claim 1:
Step 1 – Is the claim to a process, machine, manufacture or composition of matter?
Yes, the claim is a process.
Step 2A – Prong 1 – Does the claim recite an abstract idea, law of nature, or natural phenomenon?
Yes, the claim recites an abstract idea.
identifying, [by the one or more processing circuits], first features of each of the plurality of first data elements; - This limitation is directed to the abstract idea of a mental process (including an observation, evaluation, judgement, opinion) which can be performed in the human mind, or by a human using pen and paper (see MPEP 2106.04(a)(2) Ill. C.)
identifying, [by the one or more processing circuits], second features of each of the plurality of second data elements; - This limitation is directed to the abstract idea of a mental process (including an observation, evaluation, judgement, opinion) which can be performed in the human mind, or by a human using pen and paper (see MPEP 2106.04(a)(2) Ill. C.)
generating, [by the one or more processing circuits], merged features by combining a first feature of the first features of each of the plurality of first data elements with a second feature of the second features of one of the plurality of second data elements, wherein the first feature and the second feature each represent a common feature; - This limitation is directed to the abstract idea of a mental process (including an observation, evaluation, judgement, opinion) which can be performed in the human mind, or by a human using pen and paper (see MPEP 2106.04(a)(2) Ill. C.)
Step 2A – Prong 2 – Does the claim recite additional elements that integrate the judicial exception into a practical application?
No, there are no additional elements that integrate the judicial exception into a practical application. The additional elements:
by one or more processing circuits – This limitation is directed to a computer merely used as a tool to perform an existing process (see MPEP 2106.05(f) (2)).
receiving, [by one or more processing circuits], a plurality of first data elements of a first data type and a plurality of second data elements of a second data type, wherein the first data type is text data and the second data type is at least one of image data or video data; This limitation is directed to insignificant extra-solution activity – mere data gathering (see MPEP 2106.05(g)). 
generating, [by the one or more processing circuits], a model based on the common features and at least a portion of the first features and the second features; and - Adding the words "apply it" (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea [see MPEP 2106.05(f)] and therefore fails to integrate the exception into a practical application.
Step 2B – Does the claim recite additional elements that amount to significantly more than the judicial exception?
No, there are no additional elements that amount to significantly more than the judicial exception. The additional elements are:
by one or more processing circuits – This limitation is directed to a computer merely used as a tool to perform an existing process (see MPEP 2106.05(f) (2)).
receiving, [by one or more processing circuits], a plurality of first data elements of a first data type and a plurality of second data elements of a second data type, wherein the first data type is text data and the second data type is at least one of image data or video data; This limitation is directed to insignificant extra-solution activity – mere data gathering (see MPEP 2106.05(g)). 
genrating, [by the one or more processing circuits], a model based on the common features and at least a portion of the first features and the second features; and - Adding the words "apply it" (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea [see MPEP 2106.05(f)] and therefore fails to integrate the exception into a practical application.
	Regarding claim 2,
Claim 2 is rejected under 35 U.S.C 101 because the claimed invention is directed to an abstract idea without significantly more. The claim is dependent on claim 1 which includes an abstract idea (see rejection for claim 1). The additional limitations:
wherein each of the plurality of first data elements is associated with one of the plurality of second data elements; This claim merely recites a further limitation on the receiving, [by one or more processing circuits], a plurality of first data elements of a first data type and a plurality of second data elements of a second data type, wherein the first data type is text data and the second data type is at least one of image data or video data from Claim 1 which was directed to insignificant extra-solution activity – mere data gathering (see MPEP 2106.05(g)). 
generating the merged features comprises combining the first feature of the first features of each of the plurality of first data elements with the second feature of the second features of the one of the plurality of second data elements that each of the plurality of first data elements is associated with.  - This limitation is directed to the abstract idea of a mental process (including an observation, evaluation, judgement, opinion) which can be performed in the human mind, or by a human using pen and paper (see MPEP 2106.04(a)(2) Ill. C.)
Regarding claim 3,
Claim 3 is rejected under 35 U.S.C 101 because the claimed invention is directed to an abstract idea without significantly more. The claim is dependent on claim 1 which includes an abstract idea (see rejection for claim 1). The additional limitations:
wherein identifying the first features and identifying the second features includes applying one or more models to the plurality of first data elements and the plurality of second data elements - Adding the words "apply it" (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea [see MPEP 2106.05(f)] and therefore fails to integrate the exception into a practical application.
the one or more models extract the first features from the plurality of first data elements and extract the second features from the plurality of second data elements. This limitation is directed to insignificant extra-solution activity – mere data outputting (see MPEP 2106.05(g)). 
Regarding claim 4,
Claim 4 is rejected under 35 U.S.C 101 because the claimed invention is directed to an abstract idea without significantly more. The claim is dependent on claim 3 which includes an abstract idea (see rejection for claim 3). The additional limitations:
wherein the one or more models include at least one of an image embedding model, a video embedding model, an object recognition model, an audio translation model, and an optical character recognition model.  This claim merely recites a further limitation on the training, [by the one or more processing circuits], a model based on the common features and at least a portion of the first features and the second features from Claim 1 which was directed to Adding the words "apply it" (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea [see MPEP 2106.05(f)] and therefore fails to integrate the exception into a practical application.
Regarding claim 5,
Claim 5 is rejected under 35 U.S.C 101 because the claimed invention is directed to an abstract idea without significantly more. The claim is dependent on claim 1 which includes an abstract idea (see rejection for claim 1). The additional limitations:
wherein combining the first feature with the second feature comprises performing an operation on (i) a first value of the first feature representing a first confidence of the first feature and (ii) a second value of the second feature representing a second confidence of the second feature. - This limitation is directed to mathematical calculation (see MPEP 2106.04(a)(2) l. C.) as it is combining the first feature with the second feature by using a mathematical operation, such as summation, subtraction, multiplication, averaging, determining a median, etc. (see Instant Specification ¶[0039])
Regarding claim 6,
Claim 6 is rejected under 35 U.S.C 101 because the claimed invention is directed to an abstract idea without significantly more. The claim is dependent on claim 5 which includes an abstract idea (see rejection for claim 5). The additional limitations:
wherein the operation is at least one of: a maximum operation that selects a maximum of the first value and the second value; This claim merely recites a further limitation on the wherein combining the first feature with the second feature comprises performing an operation on a first value of the first feature representing a first confidence of the first feature with a second value of the second feature representing a second confidence of the second feature  from Claim 5 which was directed to mathematical calculation (see MPEP 2106.04(a)(2) l. C.) as it is combining the first feature with the second feature by using a mathematical operation, such as summation, subtraction, multiplication, averaging, determining a median, etc. (see Instant Specification ¶[0039])
a summation operation that sums the first value and the second value; This claim merely recites a further limitation on the wherein combining the first feature with the second feature comprises performing an operation on a first value of the first feature representing a first confidence of the first feature with a second value of the second feature representing a second confidence of the second feature  from Claim 5 which was directed to mathematical calculation (see MPEP 2106.04(a)(2) l. C.) as it is combining the first feature with the second feature by using a mathematical operation, such as summation, subtraction, multiplication, averaging, determining a median, etc. (see Instant Specification ¶[0039])
a median operation that determines a median of the first value and the second value; and This claim merely recites a further limitation on the wherein combining the first feature with the second feature comprises performing an operation on a first value of the first feature representing a first confidence of the first feature with a second value of the second feature representing a second confidence of the second feature  from Claim 5 which was directed to mathematical calculation (see MPEP 2106.04(a)(2) l. C.) as it is combining the first feature with the second feature by using a mathematical operation, such as summation, subtraction, multiplication, averaging, determining a median, etc. (see Instant Specification ¶[0039])
a minimum operation that selects a minimum of the first value and the second value.  This claim merely recites a further limitation on the wherein combining the first feature with the second feature comprises performing an operation on a first value of the first feature representing a first confidence of the first feature with a second value of the second feature representing a second confidence of the second feature  from Claim 5 which was directed to mathematical calculation (see MPEP 2106.04(a)(2) l. C.) as it is combining the first feature with the second feature by using a mathematical operation, such as summation, subtraction, multiplication, averaging, determining a median, etc. (see Instant Specification ¶[0039])
Regarding claim 7,
Claim 7 is rejected under 35 U.S.C 101 because the claimed invention is directed to an abstract idea without significantly more. The claim is dependent on claim 1 which includes an abstract idea (see rejection for claim 1). The additional limitations:
by the one or more processing circuits – This limitation is directed to a computer merely used as a tool to perform an existing process (see MPEP 2106.05(f) (2)).
receiving, [by the one or more processing circuits], a data element comprising a first data element of the first data type and a second data element of the second data type; This limitation is directed to insignificant extra-solution activity – mere data gathering (see MPEP 2106.05(g)). 
extracting, [by the one or more processing circuits], first inference features of the first data element and second inference features of the second data element; This limitation is directed to insignificant extra-solution activity – mere data outputting (see MPEP 2106.05(g)). 
generating, [by the one or more processing circuits], one or more merged features by combining one or more of the first inference features with one or more of the second inference features, wherein each of the one or more of the first inference features is a particular common feature to one of the one or more of the second inference features; - This limitation is directed to the abstract idea of a mental process (including an observation, evaluation, judgement, opinion) which can be performed in the human mind, or by a human using pen and paper (see MPEP 2106.04(a)(2) Ill. C.)
identifying, [by the one or more processing circuits], unique first classification features of the first classification features unique to the first data type; - This limitation is directed to the abstract idea of a mental process (including an observation, evaluation, judgement, opinion) which can be performed in the human mind, or by a human using pen and paper (see MPEP 2106.04(a)(2) Ill. C.)
identifying, [by the one or more processing circuits], unique second classification features of the second classification features unique to the second data type; and - This limitation is directed to the abstract idea of a mental process (including an observation, evaluation, judgement, opinion) which can be performed in the human mind, or by a human using pen and paper (see MPEP 2106.04(a)(2) Ill. C.)
generating, [by the one or more processing circuits], a model output of the model by applying the one or more merged features, the unique first classification features, the unique second classification features as inputs to the model.  - Adding the words "apply it" (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea [see MPEP 2106.05(f)] and therefore fails to integrate the exception into a practical application.
Regarding claim 8,
Claim 8 is rejected under 35 U.S.C 101 because the claimed invention is directed to an abstract idea without significantly more. The claim is dependent on claim 1 which includes an abstract idea (see rejection for claim 1). The additional limitations:
wherein the first data type is a text based data type and the second data type is at least one of an image data type or a video data type. This claim merely recites a further limitation on the receiving, [by one or more processing circuits], a plurality of first data elements of a first data type and a plurality of second data elements of a second data type, wherein the first data type is text data and the second data type is at least one of image data or video data from Claim 1 which was directed to insignificant extra-solution activity – mere data gathering (see MPEP 2106.05(g)). 
Regarding claim 9,
Claim 9 is rejected under 35 U.S.C 101 because the claimed invention is directed to an abstract idea without significantly more. The claim is dependent on claim 8 which includes an abstract idea (see rejection for claim 8). The additional limitations:
wherein at least a first portion of the plurality of first data elements are associated with first data element labels and at least a second portion of the plurality of second data elements are associated with second data element labels, This claim merely recites a further limitation on the receiving, [by one or more processing circuits], a plurality of first data elements of a first data type and a plurality of second data elements of a second data type, wherein the first data type is text data and the second data type is at least one of image data or video data from Claim 1 which was directed to insignificant extra-solution activity – mere data gathering (see MPEP 2106.05(g)). 
a first number of the first data element labels is greater than a second number of the second data element labels; This claim merely recites a further limitation on the receiving, [by one or more processing circuits], a plurality of first data elements of a first data type and a plurality of second data elements of a second data type, wherein the first data type is text data and the second data type is at least one of image data or video data from Claim 1 which was directed to insignificant extra-solution activity – mere data gathering (see MPEP 2106.05(g)). 
the method further comprises training, [by the one or more processing circuits], the model based on the first data element labels and the second data element labels - Adding the words "apply it" (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea [see MPEP 2106.05(f)] and therefore fails to integrate the exception into a practical application.
by the one or more processing circuits – This limitation is directed to a computer merely used as a tool to perform an existing process (see MPEP 2106.05(f) (2)).
Regarding claim 10,
Claim 10 is rejected under 35 U.S.C 101 because the claimed invention is directed to an abstract idea without significantly more. The claim is dependent on claim 8 which includes an abstract idea (see rejection for claim 8). The additional limitations:
wherein at least a first portion of the plurality of first data elements are associated with first data element labels and none of the plurality of second data elements are associated with second data element labels; This claim merely recites a further limitation on the receiving, [by one or more processing circuits], a plurality of first data elements of a first data type and a plurality of second data elements of a second data type, wherein the first data type is text data and the second data type is at least one of image data or video data from Claim 1 which was directed to insignificant extra-solution activity – mere data gathering (see MPEP 2106.05(g)). 
the method further comprises training, [by the one or more processing circuits], the model based on the first data element labels. - Adding the words "apply it" (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea [see MPEP 2106.05(f)] and therefore fails to integrate the exception into a practical application.
by the one or more processing circuits – This limitation is directed to a computer merely used as a tool to perform an existing process (see MPEP 2106.05(f) (2)).
Regarding claim 11:
Step 1 – Is the claim to a process, machine, manufacture or composition of matter?
Yes, the claim is a process.
Step 2A – Prong 1 – Does the claim recite an abstract idea, law of nature, or natural phenomenon?
Yes, the claim recites an abstract idea.
identify first features of each of the plurality of first data elements;- This limitation is directed to the abstract idea of a mental process (including an observation, evaluation, judgement, opinion) which can be performed in the human mind, or by a human using pen and paper (see MPEP 2106.04(a)(2) Ill. C.)
identify features of each of the plurality of second data elements;- This limitation is directed to the abstract idea of a mental process (including an observation, evaluation, judgement, opinion) which can be performed in the human mind, or by a human using pen and paper (see MPEP 2106.04(a)(2) Ill. C.)
generate merged features by combining a first feature of the first features of each of the plurality of first data elements with a second feature of the second features of one of the plurality of second data elements, wherein the first feature and the second feature each represent a common feature;- This limitation is directed to the abstract idea of a mental process (including an observation, evaluation, judgement, opinion) which can be performed in the human mind, or by a human using pen and paper (see MPEP 2106.04(a)(2) Ill. C.)
Step 2A – Prong 2 – Does the claim recite additional elements that integrate the judicial exception into a practical application?
No, there are no additional elements that integrate the judicial exception into a practical application. The additional elements:
A system including one or more memory devices configured to store instructions thereon, that, when executed by one or more processors, cause the one or more processors to – This limitation is directed to a computer merely used as a tool to perform an existing process (see MPEP 2106.05(f) (2)).
receive a plurality of first data elements of a first data type and a plurality of second data elements of a second data type; This limitation is directed to insignificant extra-solution activity – mere data gathering (see MPEP 2106.05(g)). 
generate a model based on the common features and at least a portion of the first features and the second features.  - Adding the words "apply it" (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea [see MPEP 2106.05(f)] and therefore fails to integrate the exception into a practical application.
Step 2B – Does the claim recite additional elements that amount to significantly more than the judicial exception?
No, there are no additional elements that amount to significantly more than the judicial exception. The additional elements are:
A system including one or more memory devices configured to store instructions thereon, that, when executed by one or more processors, cause the one or more processors to – This limitation is directed to a computer merely used as a tool to perform an existing process (see MPEP 2106.05(f) (2)).
receive a plurality of first data elements of a first data type and a plurality of second data elements of a second data type; This limitation is directed to insignificant extra-solution activity – mere data gathering (see MPEP 2106.05(g)). 
generate a model based on the common features and at least a portion of the first features and the second features.  - Adding the words "apply it" (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea [see MPEP 2106.05(f)] and therefore fails to integrate the exception into a practical application.
Regarding claim 12,
Claim 12 is rejected under 35 U.S.C 101 because the claimed invention is directed to an abstract idea without significantly more. The claim is dependent on claim 11 which includes an abstract idea (see rejection for claim 11). The additional limitations:
wherein each of the plurality of first data elements is associated with one of the plurality of second data elements; This claim merely recites a further limitation on the receive a plurality of first data elements of a first data type and a plurality of second data elements of a second data type from Claim 11 which was directed to insignificant extra-solution activity – mere data gathering (see MPEP 2106.05(g)). 
generating the merged features comprises combining the first feature of the first features of each of the plurality of first data elements with the second feature of the second features of the one of the plurality of second data elements that each of the plurality of first data elements is associated with. - This limitation is directed to the abstract idea of a mental process (including an observation, evaluation, judgement, opinion) which can be performed in the human mind, or by a human using pen and paper (see MPEP 2106.04(a)(2) Ill. C.)
Regarding claim 13,
Claim 13 is rejected under 35 U.S.C 101 because the claimed invention is directed to an abstract idea without significantly more. The claim is dependent on claim 11 which includes an abstract idea (see rejection for claim 11). The additional limitations:
identify the first features and the second features comprise applying one or more models to the plurality of first data elements and the plurality of second data elements, - Adding the words "apply it" (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea [see MPEP 2106.05(f)] and therefore fails to integrate the exception into a practical application.
the one or more models extract the first features from the plurality of first data elements and extract the second features from the plurality of second data elements.  This limitation is directed to insignificant extra-solution activity – mere data outputting (see MPEP 2106.05(g)). 
Regarding claim 14,
Claim 14 is rejected under 35 U.S.C 101 because the claimed invention is directed to an abstract idea without significantly more. The claim is dependent on claim 11 which includes an abstract idea (see rejection for claim 11). The additional limitations:
wherein the one or more models include at least one of an image embedding model, a video embedding model, an object recognition model, an audio translation model, and an optical character recognition model.  This claim merely recites a further limitation on the training, [by the one or more processing circuits], a model based on the common features and at least a portion of the first features and the second features from Claim 11 which was directed to Adding the words "apply it" (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea [see MPEP 2106.05(f)] and therefore fails to integrate the exception into a practical application.
Regarding claim 15,
Claim 15 is rejected under 35 U.S.C 101 because the claimed invention is directed to an abstract idea without significantly more. The claim is dependent on claim 11 which includes an abstract idea (see rejection for claim 11). The additional limitations:
wherein combining the first feature with the second feature comprises performing an operation on (i) a first value of the first feature representing a first confidence of the first feature and (ii) a second value of the second feature representing a second confidence of the second feature. - This limitation is directed to mathematical calculation (see MPEP 2106.04(a)(2) l. C.) as it is combining the first feature with the second feature by using a mathematical operation, such as summation, subtraction, multiplication, averaging, determining a median, etc. (see Instant Specification ¶[0039])
Regarding claim 16,
Claim 16 is rejected under 35 U.S.C 101 because the claimed invention is directed to an abstract idea without significantly more. The claim is dependent on claim 15 which includes an abstract idea (see rejection for claim 15). The additional limitations:
wherein the operation is at least one of: a maximum operation that selects a maximum of the first value and the second value; This claim merely recites a further limitation on the wherein combining the first feature with the second feature comprises performing an operation on a first value of the first feature representing a first confidence of the first feature with a second value of the second feature representing a second confidence of the second feature  from Claim 5 which was directed to mathematical calculation (see MPEP 2106.04(a)(2) l. C.) as it is combining the first feature with the second feature by using a mathematical operation, such as summation, subtraction, multiplication, averaging, determining a median, etc. (see Instant Specification ¶[0039])
a summation operation that sums the first value and the second value; This claim merely recites a further limitation on the wherein combining the first feature with the second feature comprises performing an operation on a first value of the first feature representing a first confidence of the first feature with a second value of the second feature representing a second confidence of the second feature  from Claim 5 which was directed to mathematical calculation (see MPEP 2106.04(a)(2) l. C.) as it is combining the first feature with the second feature by using a mathematical operation, such as summation, subtraction, multiplication, averaging, determining a median, etc. (see Instant Specification ¶[0039])
a median operation that determines a median of the first value and the second value; and This claim merely recites a further limitation on the wherein combining the first feature with the second feature comprises performing an operation on a first value of the first feature representing a first confidence of the first feature with a second value of the second feature representing a second confidence of the second feature  from Claim 5 which was directed to mathematical calculation (see MPEP 2106.04(a)(2) l. C.) as it is combining the first feature with the second feature by using a mathematical operation, such as summation, subtraction, multiplication, averaging, determining a median, etc. (see Instant Specification ¶[0039])
a minimum operation that selects a minimum of the first value and the second value.  This claim merely recites a further limitation on the wherein combining the first feature with the second feature comprises performing an operation on a first value of the first feature representing a first confidence of the first feature with a second value of the second feature representing a second confidence of the second feature  from Claim 5 which was directed to mathematical calculation (see MPEP 2106.04(a)(2) l. C.) as it is combining the first feature with the second feature by using a mathematical operation, such as summation, subtraction, multiplication, averaging, determining a median, etc. (see Instant Specification ¶[0039])
Regarding claim 17,
Claim 17 is rejected under 35 U.S.C 101 because the claimed invention is directed to an abstract idea without significantly more. The claim is dependent on claim 11 which includes an abstract idea (see rejection for claim 11). The additional limitations:
receive a data element comprising a first data element of the first data type and a second data element of the second data type; This limitation is directed to insignificant extra-solution activity – mere data gathering (see MPEP 2106.05(g)). 
extract first inference features of the first data element and second inference features of the second data element; This limitation is directed to insignificant extra-solution activity – mere data outputting (see MPEP 2106.05(g)). 
generate one or more merged features by combining one or more of the first inference features with one or more of the second inference features, wherein each of the one or more of the first inference features is a particular common feature to one of the one or more of the second inference features; - This limitation is directed to the abstract idea of a mental process (including an observation, evaluation, judgement, opinion) which can be performed in the human mind, or by a human using pen and paper (see MPEP 2106.04(a)(2) Ill. C.)
identify unique first classification features of the first classification features unique to the first data type; - This limitation is directed to the abstract idea of a mental process (including an observation, evaluation, judgement, opinion) which can be performed in the human mind, or by a human using pen and paper (see MPEP 2106.04(a)(2) Ill. C.)
identify unique second classification features of the second classification features unique to the second data type; and - This limitation is directed to the abstract idea of a mental process (including an observation, evaluation, judgement, opinion) which can be performed in the human mind, or by a human using pen and paper (see MPEP 2106.04(a)(2) Ill. C.)
generate a model output of the model by applying the one or more merged features, the unique first classification features, the unique second classification features as inputs to the model.  - Adding the words "apply it" (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea [see MPEP 2106.05(f)] and therefore fails to integrate the exception into a practical application.
Regarding claim 18,
Claim 18 is rejected under 35 U.S.C 101 because the claimed invention is directed to an abstract idea without significantly more. The claim is dependent on claim 11 which includes an abstract idea (see rejection for claim 11). The additional limitations:
This claim merely recites a further limitation on the receiving, [by one or more processing circuits], a plurality of first data elements of a first data type and a plurality of second data elements of a second data type, wherein the first data type is text data and the second data type is at least one of image data or video data from Claim 11 which was directed to insignificant extra-solution activity – mere data gathering (see MPEP 2106.05(g)). 
Regarding claim 19,
Claim 19 is rejected under 35 U.S.C 101 because the claimed invention is directed to an abstract idea without significantly more. The claim is dependent on claim 18 which includes an abstract idea (see rejection for claim 18). The additional limitations:
wherein at least a first portion of the plurality of first data elements are associated with first data element labels and at least a second portion of the plurality of second data elements are associated with second data element labels, This claim merely recites a further limitation on the receive a data element comprising a first data element of the first data type and a second data element of the second data type from Claim 11 which was directed to insignificant extra-solution activity – mere data gathering (see MPEP 2106.05(g)). 
 a first number of the first data element labels is greater than a second number of the second data element labels; This claim merely recites a further limitation on the receive a data element comprising a first data element of the first data type and a second data element of the second data type from Claim 11 which was directed to insignificant extra-solution activity – mere data gathering (see MPEP 2106.05(g)). 
the instructions cause the one or more processors to – This limitation is directed to a computer merely used as a tool to perform an existing process (see MPEP 2106.05(f) (2)).
train the model based on the first data element labels and the second data element labels. - Adding the words "apply it" (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea [see MPEP 2106.05(f)] and therefore fails to integrate the exception into a practical application.
Regarding claim 20,
Claim 20 is rejected under 35 U.S.C 101 because the claimed invention is directed to an abstract idea without significantly more. The claim is dependent on claim 18 which includes an abstract idea (see rejection for claim 18). The additional limitations:
wherein at least a first portion of the plurality of first data elements are associated with first data element labels and none of the plurality of second data elements are associated with second data element labels; This claim merely recites a further limitation on the receive a data element comprising a first data element of the first data type and a second data element of the second data type from Claim 11 which was directed to insignificant extra-solution activity – mere data gathering (see MPEP 2106.05(g)).
the instructions cause the one or more processors to – This limitation is directed to a computer merely used as a tool to perform an existing process (see MPEP 2106.05(f) (2)).
train the model based on the first data element labels - Adding the words "apply it" (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea [see MPEP 2106.05(f)] and therefore fails to integrate the exception into a practical application.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claim(s) 1, 2, 8, 11, 12 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Zhang et al. (“Cross-Modal and Hierarchical Modeling of Video and Text”) (hereafter referred to as “Zhang”) in view of Leibovitz et al. (US 10,223,586 B1) (hereafter referred to as “Leibovitz”)
As per claim 1, Zhang explicitly discloses:
A method comprising: receiving, by one or more processing circuits, a plurality of first data elements of a first data type and a plurality of second data elements of a second data type, wherein the first data type is text data and the second data type is at least one of image data or video data; (Zhang, Page 2, Figure 1: “Conceptual diagram of our approach for cross-modal modeling of video and texts. The main idea is to embed both low-level (clips and sentences) and high-level (video and paragraph) in their own semantic spaces coherently. 
    PNG
    media_image1.png
    317
    626
    media_image1.png
    Greyscale
”) [Examiner’s note: a first data elements i.e., multiple sentences for text data, a second data elements i.e., multiple clips for video data]
identifying, by the one or more processing circuits, first features of each of the
plurality of first data elements; (Zhang, Page 4, Section 3.1: “Likewise, we assume there is a paragraph of texts describing the video. The paragraph p contains n sentences, one for each video clip. Let si denote the ith sentence and wij the feature for the jth word out of n’i words.”) (Examiner’s note: features of first data elements is being interpreted as feature wij  for the jth word of multiple sentences (i.e., first data elements as shown above) for text data)
identifying, by the one or more processing circuits, second features of each of the plurality of second data elements; (Zhang, Page 4, Section 3.1: “A video v has n clips (or subshots), where each clip ci contains ni frames. Each frame is represented by a visual feature vector xij .”) (Examiner’s note: features of second data elements is being interpreted as a visual feature vector xij of multiple clips (i.e., second data elements as shown above) for video data)
generating, by the one or more processing circuits, merged features by combining a first feature of the first features of each of the plurality of first data elements with a second feature of the second features of one of the plurality of second data elements, (Zhang, Page 12, Section 4.3: “In addition to the video paragraph retrieval, we evaluate our learned embeddings for video captioning. Specifically, we follow [20] and train a caption model [40] on top of the pre-trained video embeddings. Similar to [20], we concatenate the clip-level feature with contextual video-level feature, and build a two-layer LSTM as a caption generator.”) [Examiner’s note: merged features is being interpreted as the learned embeddings for video captioning]
wherein the first feature and the second feature each represent a common feature; and (Zhang, Page, 8, Section 4.1: “Each video contains multiple clips and a corresponding paragraph with sentences aligned to the clips.”, Page 2, Figure 1: “Conceptual diagram of our approach for cross-modal modeling of video and texts. The main idea is to embed both low-level (clips and sentences) and high-level (video and paragraph) in their own semantic spaces coherently. As shown in the figure, the 3 sentences (and the corresponding 3 clips) are mapped into a local embedding space where the corresponding pairs of clips and sentences are placed close to each other. As a whole, the videos and the paragraphs are mapped into a global semantic space where their embeddings are close.
    PNG
    media_image1.png
    317
    626
    media_image1.png
    Greyscale
”) [Examiner’s note: pairs of clips and sentences in red, blue, green share a common feature i.e., a corresponding paragraph with sentences aligned to the video clips]
generating, by the one or more processing circuits, a model based on the common features and at least a portion of the first features and the second features; and (Zhang, Page 12, Section 4.3, ¶[1]: “In addition to the video paragraph retrieval, we evaluate our learned embeddings for video captioning. Specifically, we follow [20] and train a caption model [40] on top of the pre-trained video embeddings. Similar to [20], we concatenate the clip-level feature with contextual video-level feature, and build a two-layer LSTM as a caption generator. We randomly initialized the word embedding as well as LSTM and trained the model for 25 epochs with learning rate of 0.001.”, Page 2, ¶[2]: “Due to its narrowly focused semantic content, each clip is then describable with a sentence. The description for the whole video is then a paragraph of texts with sentences linearly arranged in order. Arguably, a corresponding pair of video and its descriptive paragraph can be embedded into a semantic space where their embeddings are close to each other”) [Examiner’s note: a model i.e., a caption model, common features i.e., concatenated clip-level feature and contextual video-level feature, a portion of first and second features i.e., the pre-trained video embeddings]
Zhang fails to disclose:
	by one or more processing circuits
However, Leibovitz explicitly discloses:
by one or more processing circuits (Leibovitz, Col. 10, Lines 30-38: “In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention”)
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Zhang and Leibovitz. Zhang teaches a novel cross-modal learning approach to model both videos and texts jointly. Leibovitz teaches a method, system, and computer program product for training a machine learning classifier to classify electronic documents based on a multi-modal training model. One of ordinary skill would have motivation to combine Zhang and Leibovitz because MPEP 2143 sets forth the Supreme Court rationales for obviousness including: (D) Applying a known technique to a known device (method, or product) ready for improvement to yield predictable results; (E): “Obvious to try” choosing from a finite number of identified, predictable solutions, with a reasonable expectation of success; (F) Known work in one field of endeavor may prompt variations of it for use in either the same field or a different one based on design incentives or other market forces if the variations are predictable to one of the ordinary skill in the art. 
As per claim 2, the combination of Zhang, Leibovitz discloses all the limitations of claim 1 (as shown in the rejections above).
Zhang in view of Leibovitz further discloses:
wherein each of the plurality of first data elements is associated with one of the plurality of second data elements; (Zhang, Page 8, Section 4.1, ¶[1]: “Each video contains multiple clips and a corresponding paragraph with sentences aligned to the clips.”, Page 2, Figure 1: 
    PNG
    media_image1.png
    317
    626
    media_image1.png
    Greyscale
”) [Examiner’s note: first data elements i.e., multiple sentences of text data, second data elements i.e., multiple clips of video data]
generating  the merged features comprises combining the first feature of the first features of each of the plurality of first data elements with the second feature of the second features of the one of the plurality of second data elements that each of the plurality of first data elements is associated with.  (Zhang, Page 12, Section 4.3: “In addition to the video paragraph retrieval, we evaluate our learned embeddings for video captioning. Specifically, we follow [20] and train a caption model [40] on top of the pre-trained video embeddings. Similar to [20], we concatenate the clip-level feature with contextual video-level feature, and build a two-layer LSTM as a caption generator.”) [Examiner’s note: merged features is being interpreted as the learned embeddings for video captioning]
As per claim 8, the combination of Zhang, Leibovitz discloses all the limitations of claim 1 (as shown in the rejections above).
Zhang in view of Leibovitz further discloses:
wherein the first data type is a text based data type and the second data type is at least one of an image data type or a video data type. (Zhang, Page 2, Figure 1: “Conceptual diagram of our approach for cross-modal modeling of video and texts. The main idea is to embed both low-level (clips and sentences) and high-level (video and paragraph) in their own semantic spaces coherently. 
    PNG
    media_image1.png
    317
    626
    media_image1.png
    Greyscale
”) 
	Referring to Independent claim 11, it is rejected on the same basis as independent claim 1 since they are analogous claims. However, claim 11 recites additional limitation:
A system comprising one or more memory devices storing instructions that, when executed by one or more processors, cause the one or more processors to: (Leibovitz, Col. 9, Lines 32-36: “The computer program product may include a computer readable storage medium ( or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention”, Col. 9, Lines 44-54: “A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD- 50 ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device having instructions recorded thereon, and any suitable combination of the foregoing.”)
	Referring to dependent claim 12, it is rejected on the same basis as dependent claim 2 since they are analogous claim.
Referring to dependent claim 18, it is rejected on the same basis as dependent claim 8 since they are analogous claim.

Claim(s) 3-7, 9-10, 13-17 and 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over Zhang et al. (“Cross-Modal and Hierarchical Modeling of Video and Text”) (hereafter referred to as “Zhang”) in view of Leibovitz et al. (US 10,223,586 B1) (hereafter referred to as “Leibovitz”)and further in view of Leonardo et al. (“Fusing Visual and Textual Information to Determine Content Safety”) (hereafter referred to as “Leonardo”) 
As per claim 3, the combination of Zhang, Leibovitz discloses all the limitations of claim 1 (as shown in the rejections above).
Zhang in view of Leibovitz further discloses:
Identifying the first features and the second features includes (Zhang, Page 4, Section 3.1: “Likewise, we assume there is a paragraph of texts describing the video. The paragraph p contains n sentences, one for each video clip. Let si denote the ith sentence and wij the feature for the jth word out of n’i words.”, Zhang, Page 4, Section 3.1: “A video v has n clips (or subshots), where each clip ci contains ni frames. Each frame is represented by a visual feature vector xij .”) (Examiner’s note: features of first data elements is being interpreted as feature wij  for the jth word of multiple sentences (i.e., first data elements as shown above) for text data, features of second data elements is being interpreted as a visual feature vector xij of multiple clips (i.e., second data elements as shown above) for video data)
Zhang in view of Leibovitz fails to disclose:
applying one or more models to the plurality of first data elements and the plurality of second data elements,
	However, Leonardo explicitly discloses:
applying one or more models to the plurality of first data elements and the plurality of second data elements, (Leonardo, Page 2026, Col. 2, ¶[2]: “In early fusion, intermediate features from separate modalities are extracted and jointly represented, then learned with one single model. Given separate, pre-trained computer vision (CV) and natural language processing (NLP) models, our fully automated framework uses late fusion to classify web pages as either safe or threat, as well as into 10 possible threat categories.”, Leonardo, Page 2027, Col. 1, Section III.A, ¶[1-2]: “We are analyzing two modalities present in a web page: visual signals and natural language. One classifier for each modality was trained to determine content safety. For the CV model, a Squeeze-and-Excitation (SE) Network [9] was trained using initial weights from ImageNet’s image classification task [10] and fine-tuned on web page images. For the NLP model, a Universal Language Model Finetuning Framework (ULMFiT) [11] was trained and fine-tuned.”,) [Examiner’s note: “one or more models” is being interpreted as the pre-trained computer vision (CV) and natural language processing (NPL) models, first data elements i.e., natural language of web pages, second data elements i.e., visual signals or images of web pages]
the one or more models extract the first features from the plurality of first data elements and extract the second features from the plurality of second data elements. (Leonardo, Page 2029, Col. 2, Section 2: “CNNs are used extensively for difficult classification tasks due to their ability to find patterns in the input data with high accuracy [22]. They are especially useful on spatial input data in which the output depends on the position of each individual feature. For this work, CNNs are used on the intermediate features extracted from the CV model and the NLP model to perform binary classification.”) [Examiner’s note: “one or more models” is being interpreted as the pre-trained computer vision (CV) and natural language processing (NPL) models, first and second features of first and second data elements i.e., the intermediate features of text data and image data]
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Zhang, Leibovitz and Leonardo. Zhang teaches a novel cross-modal learning approach to model both videos and texts jointly. Leibovitz teaches a method, system, and computer program product for training a machine learning classifier to classify electronic documents based on a multi-modal training model. Leonardo teaches a multimodal machine learning framework that fuses visual and textual information from web pages to improve current predictions of content safety. One of ordinary skill would have motivation to combine Zhang, Leibovitz and Leonardo to incorporate Leonardo’s multimodal approach into a safety-classification system to improve the system’s ability to classify web content more accurately and robustly.
As per claim 4, the combination of Zhang, Leibovitz and Leonardo discloses all the limitations of claim 3 (as shown in the rejections above).
Zhang in view of Leibovitz and Leonardo further discloses:
wherein the one or more models include at least one of an image embedding model, a video embedding model, an object recognition model, an audio translation model, and an optical character recognition model. (Leonardo, Page 2026, Col. 2, ¶[2]: “Given separate, pre-trained computer vision (CV) and natural language processing (NLP) models, our fully automated framework uses late fusion to classify web pages as either safe or threat, as well as into 10 possible threat categories.”, Page 2027, Col. 1, Section III.A: “For the CV model, a Squeeze-and-Excitation (SE) Network [9] was trained using initial weights from ImageNet’s image classification task [10] and fine-tuned on web page images (Figure 1).”) [Examiner’s note: computer vision model i.e., an object recognition model]
As per claim 5, the combination of Zhang, Leibovitz discloses all the limitations of claim 1 (as shown in the rejections above).
Zhang in view of Leibovitz fails to disclose:
wherein combining the first feature with the second feature comprises performing an operation on (i) a first value of the first feature representing a first confidence of the first feature and (ii) a second value of the second feature representing a second confidence of the second feature.
	However, Leonardo explicitly discloses:
wherein combining the first feature with the second feature comprises performing an operation on (i) a first value of the first feature representing a first confidence of the first feature and (ii) a second value of the second feature representing a second confidence of the second feature. (Leonardo, Page 2028, Col. 1, Section IV.A.1: “Since unique web pages in the dataset are often associated with multiple images, the visual features of these images are combined first by choosing the minimum value of x0 (confidence score for TC0) and maximum value for xi (confidence score for threat TCi, i = 1, 2, . . . , 9), componentwise across all the images corresponding to the same web page. The method used for merging different x’s using the aforementioned criteria is named minmax.… Now, a single 10-dimensional vector represents the features from all the images in a web page and another 10-dimensional vector represents the textual features. In order to merge both and train them altogether, we can average each component, concatenate them both or use the minmax method. The results from any of the three methods arrive at xwp, a vector that’s comprised of visual and textual features from a web page, while y represents xwp’s ground truth.”) [Examiner’s note: an operation i.e., the minmax method, first value representing a first confidence i.e., the confidence score for visual features, the second value representing a second confidence i.e., the confidence score for textual features]
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Zhang, Leibovitz and Leonardo. Zhang teaches a novel cross-modal learning approach to model both videos and texts jointly. Leibovitz teaches a method, system, and computer program product for training a machine learning classifier to classify electronic documents based on a multi-modal training model. Leonardo teaches a multimodal machine learning framework that fuses visual and textual information from web pages to improve current predictions of content safety. One of ordinary skill would have motivation to combine Zhang, Leibovitz and Leonardo to incorporate Leonardo’s multimodal approach into a safety-classification system to improve the system’s ability to classify web content more accurately and robustly.
As per claim 6, the combination of Zhang, Leibovitz and Leonardo discloses all the limitations of claim 5 (as shown in the rejections above).
Zhang in view of Leibovitz and Leonardo further discloses:
wherein the operation is at least one of: a maximum operation that selects a maximum of the first value and the second value;  a summation operation that sums the first value and the second value;  a median operation that determines a median of the first value and the second value; and a minimum operation that selects a minimum of the first value and the second value. (Leonardo, Page 2028, Col. 1, Section IV.A.1: “Since unique web pages in the dataset are often associated with multiple images, the visual features of these images are combined first by choosing the minimum value of x0 (confidence score for TC0) and maximum value for xi (confidence score for threat TCi, i = 1, 2, . . . , 9), componentwise across all the images corresponding to the same web page. The method used for merging different x’s using the aforementioned criteria is named minmax.… Now, a single 10-dimensional vector represents the features from all the images in a web page and another 10-dimensional vector represents the textual features. In order to merge both and train them altogether, we can average each component, concatenate them both or use the minmax method. The results from any of the three methods arrive at xwp, a vector that’s comprised of visual and textual features from a web page, while y represents xwp’s ground truth.”) [Examiner’s note: first and second value i.e., the confidence score corresponding to visual features and textual features, the minmax operation i.e., selecting maximum value and selecting minimum value, concatenation operation i.e., summation operation] 
As per claim 7, the combination of Zhang, Leibovitz discloses all the limitations of claim 1 (as shown in the rejections above).
Zhang in view of Leibovitz further discloses:
receiving, by the one or more processing circuits, a data element comprising a first data element of the first data type and a second data element of the second data type; (Zhang, Page 2, Figure 1: “Conceptual diagram of our approach for cross-modal modeling of video and texts. The main idea is to embed both low-level (clips and sentences) and high-level (video and paragraph) in their own semantic spaces coherently. 
    PNG
    media_image1.png
    317
    626
    media_image1.png
    Greyscale
”) [Examiner’s note: a first data elements i.e., multiple sentences for text data, a second data elements i.e., multiple clips for video data]
by one or more processing circuits by one or more processing circuits (Leibovitz, Col. 10, Lines 30-38: “In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention”)
extracting, by the one or more processing circuits, first inference features of the first data element and second inference features of the second data element; (Zhang, Page 18, Section A.1, ¶[1]: “In all our experiments under this setting, we extract frame-wise video feature using C3D model pre-trained on Sports-1M dataset, with the temporal stride of 16.”, Page 18, Section A.1, ¶[3]: “Word Features. In the retrieval related experiments, we always use GloVE features [30] for the initialization of the word embedding and fine-tune. Specifically, we use the GloVE vectors pre-trained on 840B common web-crawled data, with its dimensionality equals to 300.”) [Examiner’s note: first inference features i.e., the video features, second inference features i.e., word features ]
generating, by the one or more processing circuits, one or more merged features by combining one or more of the first inference features with one or more of the second inference features, (Zhang, Page 12, Section 4.3: “In addition to the video paragraph retrieval, we evaluate our learned embeddings for video captioning. Specifically, we follow [20] and train a caption model [40] on top of the pre-trained video embeddings. Similar to [20], we concatenate the clip-level feature with contextual video-level feature, and build a two-layer LSTM as a caption generator.”) [Examiner’s note: merged features is being interpreted as the learned embeddings for video captioning]
wherein each of the one or more of the first inference features is a particular common feature to one of the one or more of the second inference features; (Zhang, Page, 8, Section 4.1: “Each video contains multiple clips and a corresponding paragraph with sentences aligned to the clips.”, Page 2, Figure 1: “Conceptual diagram of our approach for cross-modal modeling of video and texts. The main idea is to embed both low-level (clips and sentences) and high-level (video and paragraph) in their own semantic spaces coherently. As shown in the figure, the 3 sentences (and the corresponding 3 clips) are mapped into a local embedding space where the corresponding pairs of clips and sentences are placed close to each other. As a whole, the videos and the paragraphs are mapped into a global semantic space where their embeddings are close.
    PNG
    media_image1.png
    317
    626
    media_image1.png
    Greyscale
”) [Examiner’s note: pairs of clips and sentences in red, blue, green share a common feature i.e., a corresponding paragraph with sentences aligned to the video clips]
identifying, by the one or more processing circuits, unique first classification features of the first classification features unique to the first data type; (Zhang, Page 4, Section 3.1: “Likewise, we assume there is a paragraph of texts describing the video. The paragraph p contains n sentences, one for each video clip. Let si denote the ith sentence and wij the feature for the jth word out of n’i words.”) (Examiner’s note: first classification features is being interpreted as feature wij  for the jth word of multiple sentences (i.e., first data elements as shown above) for text data)
identifying, by the one or more processing circuits, unique second classification features of the second classification features unique to the second data type; and (Zhang, Page 4, Section 3.1: “A video v has n clips (or subshots), where each clip ci contains ni frames. Each frame is represented by a visual feature vector xij .”) (Examiner’s note: second classification features is being interpreted as a visual feature vector xij of multiple clips (i.e., second data elements as shown above) for video data)
	Zhang in view of Leibovitz fails to disclose:
generating, by the one or more processing circuits, a model output of the model by applying the one or more merged features, the unique first classification features, the unique second classification features as inputs to the model.
	However, Leonardo explicitly discloses:
generating, by the one or more processing circuits, a model output of the model by applying the one or more merged features, the unique first classification features, the unique second classification features as inputs to the model. (Leonardo, Page 2029, Col. 1, Section V.A.1, ¶[2]: “Once the visual and textual intermediate features are extracted and normalized to [0, 1], they are passed into an image and text autoencoder, respectively.”, Page 2029, Col. 2, ¶[2]: “for each web page with one textual feature t and N visual features vi (where N varies for each web page), new data is created by concatenating t and vi for 1 ≤ i ≤ N. These concatenated features are passed into a random forest classifier, and hyperparameters are selected as described in Algorithms 1 and 2.”, Page 2029, Col. 2, ¶[4]: “Thus, a much larger dataset was created by generating random pairs of one visual feature and one textual feature to produce 40,000 total pairings. These pairings are constructed so that 10,000 image-text combinations are produced for each of the four combinations: unsafe images with unsafe text, unsafe images with safe text, safe images with unsafe text, and safe images with safe text. The objective now is to make the CNN learn to predict the safety of image-text pair. The CNN is trained on these random pairings and is tested on the page-level by combining predictions with OR Logic.”) [Examiner’s note: a model output i.e., the image-text pair safety prediction, merged feature i.e., the concatenated textual feature t and visual features vi, first classification feature i.e., the textual feature, second classification feature i.e., the visual feature, the model i.e., the CNN model]
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Zhang, Leibovitz and Leonardo. Zhang teaches a novel cross-modal learning approach to model both videos and texts jointly. Leibovitz teaches a method, system, and computer program product for training a machine learning classifier to classify electronic documents based on a multi-modal training model. Leonardo teaches a multimodal machine learning framework that fuses visual and textual information from web pages to improve current predictions of content safety. One of ordinary skill would have motivation to combine Zhang, Leibovitz and Leonardo to incorporate Leonardo’s multimodal approach into a safety-classification system to improve the system’s ability to classify web content more accurately and robustly.
As per claim 9, the combination of Zhang, Leibovitz discloses all the limitations of claim 8 (as shown in the rejections above).
Zhang in view of Leibovitz fails to discloses:
wherein at least a first portion of the plurality of first data elements are associated with first data element labels and at least a second portion of the plurality of second data elements are associated with second data element labels,
a first number of the first data element labels is greater than a second number of the second data element labels;
the method further comprises training, by the one or more processing circuits, the model based on the first data element labels and the second data element labels.
However, Leonardo discloses:
wherein at least a first portion of the plurality of first data elements are associated with first data element labels and at least a second portion of the plurality of second data elements are associated with second data element labels, (Leonardo, Page 2029, Col. 1, Section B: “Each of these classifiers is trained to identify if xwp should be labeled as TCi or not, using yi, i = 0, 1, . . . , 9 as the corresponding ground truth for the respective TC.”, Page 2028, Col. 1, Section IV.A.1: “The results from any of the three methods arrive at xwp, a vector that’s comprised of visual and textual features from a web page, while y represents xwp’s ground truth.”, Page 2027, Col. 2, Section B: “Our binary classification algorithms classify every web page as either safe or threat. For multilabel classification, there are 10 target labels that can be assigned to every web page. Some web pages can be classified in multiple threat categories as well. Thus, a binary classifier is trained to identify each threat category (TC), shown in Table I. 
    PNG
    media_image2.png
    323
    569
    media_image2.png
    Greyscale
”) [Examiner’s note: xwp, a vector that’s comprised of visual and textual features from a web page (i.e., first and second portion of the plurality of first data elements) is classified with labels TCi with i = 0, 1, … 9]
a first number of the first data element labels is greater than a second number of the second data element labels; (Leonardo, Page 2027, Col. 2, Section B, ¶[2]: “The training and testing datasets consist of 2643 and 620 web pages respectively, which were annotated by an independent company. Out of the web pages in the training data, 1653 are labeled as safe and 990 as threat. If a web page is a threat, it can belong to multiple TCs.”) [Examiner’s note: the first data element label i.e., safe, the second data element label i.e., threat]
the method further comprises training, by the one or more processing circuits, the model based on the first data element labels and the second data element labels. (Leonardo, Page 2029, Col. 2, Section 2, ¶[2]: “Thus, a much larger dataset was created by generating random pairs of one visual feature and one textual feature to produce 40,000 total pairings. These pairings are constructed so that 10,000 image-text combinations are produced for each of the four combinations: unsafe images with unsafe text, unsafe images with safe text, safe images with unsafe text, and safe images with safe text. The objective now is to make the CNN learn to predict the safety of image-text pair. The CNN is trained on these random pairings and is tested on the page-level by combining predictions with OR Logic.”) [Examiner’s note: the first data element label i.e., safe, the second data element label i.e., threat or unsafe ]
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Zhang, Leibovitz and Leonardo. Zhang teaches a novel cross-modal learning approach to model both videos and texts jointly. Leibovitz teaches a method, system, and computer program product for training a machine learning classifier to classify electronic documents based on a multi-modal training model. Leonardo teaches a multimodal machine learning framework that fuses visual and textual information from web pages to improve current predictions of content safety. One of ordinary skill would have motivation to combine Zhang, Leibovitz and Leonardo to incorporate Leonardo’s multimodal approach into a safety-classification system to improve the system’s ability to classify web content more accurately and robustly.
As per claim 10, the combination of Zhang, Leibovitz discloses all the limitations of claim 8 (as shown in the rejections above).
Zhang in view of Leibovitz fails to disclose:
wherein at least a first portion of the plurality of first data elements are associated with first data element labels and none of the plurality of second data elements are associated with second data element labels;
the method comprises training, by the one or more processing circuits, the model based on the first data element labels.  
However, Leonardo explicitly discloses:
wherein at least a first portion of the plurality of first data elements are associated with first data element labels and none of the plurality of second data elements are associated with second data element labels; (Leonardo, Page 2027, Col. 2, Section B, ¶[2]: “The training and testing datasets consist of 2643 and 620 web pages respectively, which were annotated by an independent company. Out of the web pages in the training data, 1653 are labeled as safe and 990 as threat. If a web page is a threat, it can belong to multiple TCs.”, Page 2029, Col. 2, Section 2, ¶[2]: “Thus, a much larger dataset was created by generating random pairs of one visual feature and one textual feature to produce 40,000 total pairings. These pairings are constructed so that 10,000 image-text combinations are produced for each of the four combinations: unsafe images with unsafe text, unsafe images with safe text, safe images with unsafe text, and safe images with safe text.) [Examiner’s note: the first data element label i.e., safe, the second data element label i.e., threat or unsafe, first portion of first data elements i.e., text, second portion of second data elements i.e., images. “safe images with safe text” is being interpreted as image data is not associated with threat or unsafe label element.]
the method comprises training, by the one or more processing circuits, the model based on the first data element labels.  (Leonardo, Page 2029, Col. 2, Section 2, ¶[2]: “Thus, a much larger dataset was created by generating random pairs of one visual feature and one textual feature to produce 40,000 total pairings. These pairings are constructed so that 10,000 image-text combinations are produced for each of the four combinations: unsafe images with unsafe text, unsafe images with safe text, safe images with unsafe text, and safe images with safe text. The objective now is to make the CNN learn to predict the safety of image-text pair. The CNN is trained on these random pairings and is tested on the page-level by combining predictions with OR Logic.”) [Examiner’s note: first data element label i.e., the safe label in “safe images with safe text” data is used to train the CNN model]
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Zhang, Leibovitz and Leonardo. Zhang teaches a novel cross-modal learning approach to model both videos and texts jointly. Leibovitz teaches a method, system, and computer program product for training a machine learning classifier to classify electronic documents based on a multi-modal training model. Leonardo teaches a multimodal machine learning framework that fuses visual and textual information from web pages to improve current predictions of content safety. One of ordinary skill would have motivation to combine Zhang, Leibovitz and Leonardo to incorporate Leonardo’s multimodal approach into a safety-classification system to improve the system’s ability to classify web content more accurately and robustly.
Referring to dependent claim 13, it is rejected on the same basis as dependent claim 3 since they are analogous claims.
Referring to dependent claim 14, it is rejected on the same basis as dependent claim 4 since they are analogous claims.
Referring to dependent claim 15, it is rejected on the same basis as dependent claim 5 since they are analogous claims.
Referring to dependent claim 16, it is rejected on the same basis as dependent claim 6 since they are analogous claims.
Referring to dependent claim 17, it is rejected on the same basis as dependent claim 7 since they are analogous claims.
Referring to dependent claim 19, it is rejected on the same basis as dependent claim 9 since they are analogous claims.
Referring to dependent claim 20, it is rejected on the same basis as dependent claim 10 since they are analogous claims.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to AMY TRAN whose telephone number is (571)270-0693. The examiner can normally be reached Monday - Friday 7:30 am - 5:00 pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, David Yi can be reached at (571) 270-7519. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/AMY TRAN/Examiner, Art Unit 2126                                                                                                                                                                                                        
/DAVID YI/Supervisory Patent Examiner, Art Unit 2126
Read full office action
Prosecution Timeline

Dec 22, 2023
Application Filed
Mar 07, 2026
Non-Final Rejection — §101, §103, §DP (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/226,399
Patent 12602582
DYNAMIC DISTRIBUTED TRAINING OF MACHINE LEARNING MODELS
2y 5m to grant Granted Apr 14, 2026
17/137,588
Patent 12468932
IDENTIFYING RELATED MESSAGES IN A NATURAL LANGUAGE INTERACTION
2y 5m to grant Granted Nov 11, 2025
16/996,310
Patent 12462185
SCENE GRAMMAR BASED REINFORCEMENT LEARNING IN AGENT TRAINING
2y 5m to grant Granted Nov 04, 2025
17/111,611
Patent 12423589
TRAINING DECISION TREE-BASED PREDICTIVE MODELS
2y 5m to grant Granted Sep 23, 2025
16/261,092
Patent 12288074
GENERATING AND PROVIDING PROPOSED DIGITAL ACTIONS IN HIGH-DIMENSIONAL ACTION SPACES USING REINFORCEMENT LEARNING MODELS
2y 5m to grant Granted Apr 29, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
36%
Grant Probability
84%
With Interview (+47.9%)
5y 2m
Median Time to Grant
Low
PTA Risk
Based on 28 resolved cases by this examiner. Grant probability derived from career allow rate.