Last updated: April 19, 2026
Application No. 18/544,187
MULTIMODAL CONTENT RELEVANCE PREDICTION USING NEURAL NETWORKS

Non-Final OA §103
Filed
Dec 18, 2023
Examiner
BONANSINGA, AARON TIMOTHY
Art Unit
2673
Tech Center
2600 — Communications
Assignee
Microsoft Technology Licensing, LLC
OA Round
1 (Non-Final)
Interview Optional

— +33.3% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 25 resolved cases, 2023–2026
Examiner Intelligence

BONANSINGA, AARON TIMOTHY View full profile →
Grants 76% — above average
Career Allow Rate
19 granted / 25 resolved
+14.0% vs TC avg
Strong +33% interview lift
Without
With
+33.3%
Interview Lift
resolved cases with interview
Typical timeline
2y 11m
Avg Prosecution
29 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
7.4%
-32.6% vs TC avg
§103
69.6%
+29.6% vs TC avg
§102
10.3%
-29.7% vs TC avg
§112
9.2%
-30.8% vs TC avg
Black line = Tech Center average estimate • Based on career data from 25 resolved cases
Office Action

§103
DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statement(s) (IDS) submitted on 01/12/2024 has been considered by the examiner and placed in applicant file.

Claim Objections
Claim 2 is objected to because of the following informalities: 
At line 8 the term “and obtaining the dense mage embedding as output of the fully connected layers.” should be changed to “and obtaining the dense image embedding as output of the fully connected layers.” to correct a misspelling of “image”. Appropriate correction is required.

Claim 8 is objected to under 37 CFR 1.75 as being a substantial duplicate of claim 1.
When two claims in an application are duplicates or else are so close in content that they both cover the same thing, despite a slight difference in wording, it is proper after allowing one claim to object to the other as being a substantial duplicate of the allowed claim. See MPEP § 608.01(m). Please cancel or amend claim 8. 

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 3, 6, 8-11, 15-16, and 18 are rejected under 35 U.S.C. 103 as being unpatentable over ZENG et al. (US 20230351115 A1), hereinafter referenced as ZENG in view of ZHANG et al. (US 20220237480 A1), hereinafter referenced as ZHANG and in further view of LUO et al. (US 20240184835 A1), hereinafter referenced as LUO.

Regarding claim 1, ZENG explicitly teaches a method (Fig. 15. Paragraph [0090]-ZENG discloses FIG. 15 shows a block diagram of a document image processing system 1500. As shown in FIG. 15, the document image processing system 1500 includes a semantic token extraction module 1505, a semantic token embedding module 1560, a visual token extraction module 1570, a visual token embedding module 1580, and a multimodal fusion module 1590.) comprising: 
determining a reduced dimensionality contextual text embedding from a contextual text embedding using a second pretrained dense neural sub-network (Fig. 15. Paragraph [0099]-ZENG discloses the token detection module 1510 may be implemented using a pre-trained token detection model, which may include, for example, a transformer model and/or a convolutional neural network (CNN) model. In paragraph [0104]-ZENG discloses the semantic token embedding module 1560 may be implemented to include a token embedding matrix (e.g., of a BERT or other language transformer model) that has been extended to handle special tokens. Further in paragraph [0113]-ZENG discloses the visual token embedding module 1580 may be implemented to include an image embedding portion of a vision transformer or multimodal transformer. In paragraph [0117]-ZENG discloses the transformer model 1920 may be pre-trained as a multimodal fusion transformer model. In paragraph [0122]-ZENG discloses the one or more output layers 1930 of the multimodal fusion module 1590 may be configured to include one or more of any of the following: a linear (e.g., fully connected) layer, a feed-forward neural network (e.g., multi-layer perceptron), a softmax layer. Please also read paragraph [0061]), the contextual text embedding encapsulating features of a text associated with the multimodal content (Fig. 15. Paragraph [0103]-ZENG discloses the semantic token embedding module 1560 generates, for each of the semantic tokens, a corresponding semantic token embedding. The information embedded in each semantic token embedding may include the content of the corresponding semantic token (e.g., textual content for a word token, token definition for a predicted special token (‘_c_’, ‘_sg_’, ‘_st_’)). The information embedded in each semantic token embedding may also include the token type of the corresponding semantic token (e.g., ‘0’ for word, ‘1’ for checkbox, ‘2’ for signature, ‘3’ for stample). The semantic token embedding module 1560 may be configured to translate a vector that represents such information of the semantic token into a lower-dimensional space. Further in paragraph [0115-ZENG discloses the multimodal fusion module may include an input processing module 1910, a transformer model 1920, and one or more output layers 1930. The input processing module 1910 may assemble the input to the trained model by concatenating the semantic token embeddings and the visual token embeddings); 
Although ZENG explicitly teaches determining a reduced dimensionality dense image embedding from a dense image embedding using a first pretrained neural sub-network (Fig. 15. Paragraph [0099]-ZENG discloses the token detection module 1510 may be implemented using a pre-trained token detection model, which may include a transformer model and/or a convolutional neural network (CNN) model. In paragraph [0104]-ZENG discloses the semantic token embedding module 1560 may be implemented to include a token embedding matrix (e.g., of a BERT or other language transformer model) that has been extended to handle special tokens. Further in paragraph [0113]-ZENG discloses the visual token embedding module 1580 may be implemented to include an image embedding portion of a vision transformer or multimodal transformer. In paragraph [0117]-ZENG discloses the transformer model 1920 may be pre-trained as a multimodal fusion transformer model. In paragraph [0122]-ZENG discloses the one or more output layers 1930 of the multimodal fusion module 1590 may include one or more of any of the following: a linear (e.g., fully connected) layer, a feed-forward neural network (e.g., multi-layer perceptron), a softmax layer.), the dense image embedding encapsulating features of a digital image associated with a multimodal content (Fig. 15. Paragraph [0103]-ZENG discloses the semantic token embedding module 1560 may be configured to translate a vector that represents such information of the semantic token into a lower-dimensional space. Further in paragraph [0111]-ZENG discloses the visual token embedding module 1580 may apply a trainable linear projection to each flattened visual token to align the dimensionality of the resulting visual token embeddings with that of the semantic token embedding); 
ZENG is silent on a first pretrained dense neural sub-network.
However, ZHANG explicitly teaches determining a dimensionality dense image embedding from a dense image embedding using a first pretrained dense neural sub-network (Fig. 3. Paragraph [0044]-ZHANG discloses image encoder 325 may be a convolutional neural network model, such as an EfficientNet model, and may encode event image data 310-b and 310-c. The convolutional neural network model may be trained on an ImageNet dataset. In paragraph [0045]-ZHANG discloses after encoding, the embeddings may be normalized before being included in a set of vectors for each of the entity multimodal model and the event multimodal. A Dense neural network model 340 may be used to encode embeddings of a given dimension into a D-dimensional embedding (or other dimensional embedding supported for training a sequential model). Dense neural network model 340-a may encode each categorical data point of the entity category data 305-c, and generate an embedding 335 that is C-dimensional. A dense neural network model 340-b may encode embedding 330 having G-dimension into an embedding of a D-dimension that is included in the set of vectors 350. Dense neural network model 340-c may encode embedding 335 having C-dimension into an embedding of a D-dimension that is included in the set of vectors 350. Dense neural network model 340-d may encode embeddings 365 and 370, each having E-dimension into respective embeddings of D-dimension that are included in the set of vectors 380)), the dense image embedding encapsulating features of a digital image (Fig. 3. Paragraph [0044]-ZHANG discloses for image data (e.g., event image data 310-b and 310-c), one or more image encoders 325 may be used. The convolutional neural network model may be trained to encode both event image data 310-b and 310-c into sequences of N embeddings, such as embedding 365 and embedding 370, where each embedding is E-dimensional) associated with a multimodal content (Fig. 3. Paragraph [0041]-ZHANG discloses an entity may have multimodal data 305, such as entity text data 305-a, entity graph data 305-b, entity category data 305-c, among others (numeric, etc.), and an event may have multimodal data 310, such as event text data 310-a, and event image data 310-b and 310-c. The multimodal data 305 of the entity and the multimodal data 310 of the event may have data associated with different formats, may have missing data of a given modality, may be of different dimensions, etc. Multimodal data 305 of the entity can apply to individual customers and customer organizations (e.g., companies, businesses), and multimodal data 310 of the event may apply to events, tracks, and sessions); 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to combine the teachings of ZENG having a method comprising: determining a reduced dimensionality dense image embedding from a dense image embedding using a first pretrained dense neural sub-network, the dense image embedding encapsulating features of a digital image associated with a multimodal content, with the teachings of ZHANG having determining a dimensionality dense image embedding from a dense image embedding using a first pretrained dense neural sub-network, the dense image embedding encapsulating features of a digital image associated with a multimodal content.
Wherein having ZENG’s method having determining a reduced dimensionality dense image embedding from a dense image embedding using a first pretrained dense neural sub-network, the dense image embedding encapsulating features of a digital image associated with a multimodal content.
The motivation behind the modification would have been to obtain a method that improves data processing and predictions, since both ZENG and ZHANG concern systems and methods for processing multi-modal data using transformer neural networks. Wherein ZENG provides systems and methods that encodes data and generates embeddings using transformer networks to improve entity extraction and modeling predictions, which improves overall document processing workflow, while ZHANG provides systems and methods that encodes data and generates embeddings using transformer networks to improve user predictions and recommendations. Please see ZENG et al. (US 20230351115 A1), Abstract and Paragraph [0035-0038] and ZHANG et al. (US 20220237480 A1), Abstract and Paragraph [0025-0027 and 0050].
Although ZENG explicitly teaches determining multimodal content using a third dense neural sub-network, the reduced dimensionality dense image embedding, and the reduced dimensionality contextual text embedding (Fig. 15. Paragraph [0114]-ZENG discloses the multimodal fusion module 1590 applies a trained model to process an input that is based on the plurality of semantic token embeddings and the plurality of visual token embeddings to generate a semantic processing result that includes at least one of the following: a predicted location for each of a plurality of entities in the document page image, a predicted document type of a document page depicted in the document page image, or a predicted answer to a question about the document page depicted in the document page image. In paragraph [0115]-ZENG discloses the multimodal fusion module may include an input processing module 1910, a transformer model 1920, and one or more output layers 1930. The input processing module 1910 may assemble the input to the trained model by concatenating the semantic token embeddings and the visual token embeddings. The input processing module 1910 may also embed the layout information (e.g., may include one or more embedding layers) and combine the layout embeddings with (e.g., add them to) the corresponding token embeddings. Please also read paragraph [0103 and 0111]).
ZENG in view of ZHANG fail to explicitly teach determining a numerical score of a multimodal content using a third dense neural sub-network; and ranking the multimodal content based on the numerical score.
However, LUO explicitly teaches determining a numerical score (Fig. 15. Paragraph [0120]-LUO discloses FIG. 6 illustrates operation of the user interest modelling module 306 in one embodiment. The graph embedding of the target item and user's recent interacted item are concatenated, fed into an MLP, and then the softmax function is used to get the attention weight. The attention weight is multiplied with the user's recent interacted item embeddings and the weighted sum is used to represent user interests. In paragraph [0130]-LUO discloses the inventors of the invention have realized that a tower-shaped MLP structure can be used for predicting ratings in recommender system. In paragraph [0149]-LUO discloses in the experiment a leave-one-out evaluation strategy is used: reserve the latest interacted item for every user as the test item. Then, a list of items is generated and the items are ranked using predicted score. In the setting, t items are recommended to a user each time and the following two metrics are used to evaluate the performance of different algorithms: Hit Ratio (HR) measuring the chance that our recommendation list contains users' interested items and a weighted version of HR, termed Normalized Discounted Cumulative Gain (NDCG), which puts more weight on items that are ranked higher in the recommendation list) of a multimodal content (Fig. 15. Paragraph [0075]-LUO discloses multimodal contextual information is incorporated via various embedding techniques, like specific pre-trained models for texts and images (e.g., Sentence-BERT and ResNet), pre-processing techniques (e.g., normalization and feature crossing) for other categorical and/or dense contextual features. To reduce the feature dimensionality and learn the feature interaction, the generated feature vectors are fed into the feature crossing layer to learn the higher-order non-linear feature interactions. After obtaining the extracted graph embedding and contextual embedding, the self-attention mechanism can be used to model the user interests. In paragraph [0116]-LUO discloses after one-hot embedding of categorical features, preprocessing of dense features, and embedding of text and image data, embeddings of user and item attributes and user-item interactions can be obtained. Instead of using the embedding vector x.sub.0(t) directly, it is passed through a feature crossing network to learn high-order interaction across features and a fusion layer to reduce the dimensionality (wherein the embedding vector of all the features are expressed and concatenated, and the fusion layer uses a self-attention mechanism to obtain a final embedding of the contextual information). Further in paragraph [0129]-LUO discloses with respect to global interaction learning, attention mechanism is also adopted. Specifically, all the embedding are concatenated. Please also read paragraph [0123-0128]) using a third dense neural sub-network (Fig. 15. Paragraph [0113]-LUO discloses a pre-trained model called Sentence-BERT, can be used to process the raw data and get fixed-length vectors as output. In paragraph [0114]-LUO discloses due to the high dimensionality of raw images, a pre-trained convolutional neural network (CNN) can be used to process the images and get fixed-length vectors as output. In paragraph [0131]-LUO discloses a DNN prediction model, in particular a MLP 310, is used as the prediction network. The input of the MLP 310 includes the global feature representation generated by the global interaction module. The MLP 310 has fully connected layers); and 
ranking the multimodal content based on the numerical score (Fig. 15. Paragraph [0149]-LUO discloses in the experiment a leave-one-out evaluation strategy is used: reserve the latest interacted item for every user as the test item. Then, a list of items is generated and the items are ranked using predicted score. In the setting, t items are recommended to a user each time and the following two metrics are used to evaluate the performance of different algorithms: Hit Ratio (HR) measuring the chance that our recommendation list contains users' interested items and a weighted version of HR, termed Normalized Discounted Cumulative Gain (NDCG), which puts more weight on items that are ranked higher in the recommendation list).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to combine the teachings of ZENG in view of ZHANG of having a method comprising: determining a reduced dimensionality dense image embedding from a dense image embedding using a first pretrained dense neural sub-network, the dense image embedding encapsulating features of a digital image associated with a multimodal content, with the teachings of LUO having determining a numerical score of a multimodal content using a third dense neural sub-network; and ranking the multimodal content based on the numerical score.
Wherein having ZENG’s method having determining a numerical score of a multimodal content using a third dense neural sub-network, the reduced dimensionality dense image embedding, and the reduced dimensionality contextual text embedding, the reduced dimensionality dense image embedding, and the reduced dimensionality contextual text embedding; and ranking the multimodal content based on the numerical score.
The motivation behind the modification would have been to obtain a method that improves data processing and predictions using transformer models, since both ZENG and LUO concern systems and methods for processing multi-modal data using transformer neural networks. Wherein ZENG provides systems and methods that encodes data and generates embeddings using transformer networks to improve entity extraction and modeling predictions, which improves overall document processing workflow, while LUO provides systems and methods that improve the performance of transformer models and help to disentangle the relative static user-item interactions and the more dynamic contexts to provide improved methods. Please see ZENG et al. (US 20230351115 A1), Abstract and Paragraph [0035-0038] and LUO et al. (US 20240184835 A1), Abstract and Paragraph [0051, 0150, 0155].

Regarding claim 3, ZENG in view of ZHANG and in further view of LUO explicitly teach the method of claim 1, ZENG further teaches wherein the contextual text embedding is generated (Fig. 15. Paragraph [0090]-ZENG discloses FIG. 15 shows a block diagram of a document image processing system 1500 (wherein system 1500 includes embedding module 1560, a visual token extraction module 1570, a visual token embedding module 1580, and a multimodal fusion module 1590, which generate text and image embeddings to produce document predictions using multiple pre-trained neural networks/sub-networks. Please see paragraph [0103, 0111, and 0114-0115]) using a pretrained transformer neural network (Fig. 15. Paragraph [0099]-ZENG discloses the token detection module 1510 may be implemented using a pre-trained token detection model, which may include, for example, a transformer model. A transformer model includes a stack of layers having attention mechanisms, such as multi-head self-attention mechanisms (MHSA). The transformer model may be a vision transformer model. In paragraph [0114]-ZENG discloses the multimodal fusion module 1590 applies a trained model to process an input that is based on the plurality of semantic token embeddings and the plurality of visual token embeddings to generate a semantic processing result that includes at least one of the following: a predicted location for each of a plurality of entities in the document page image, a predicted document type of a document page depicted in the document page image, or a predicted answer to a question about the document page depicted in the document page image. In paragraph [0115]-ZENG discloses the multimodal fusion module may include a transformer model 1920); 
wherein the pretrained transformer neural network comprises transformer layers to capture bidirectional contexts (Fig. 15. Paragraph [0104]-ZENG discloses the semantic token embedding module 1560 may be implemented to include a token embedding matrix (e.g., of a BERT or other language transformer model) that has been extended to handle special tokens (wherein BERT stands for Bi-directional Encoder Representation from Transformers). Please also read paragraph [0061]); and 
wherein the pretrained transformer neural network generated the contextual text embedding (Fig. 15. Paragraph [0103]-ZENG discloses the semantic token embedding module 1560 generates, for each of the semantic tokens, a corresponding semantic token embedding) based on: 
tokenizing the text into tokens (Fig. 15. Paragraph [0093]-ZENG discloses based on at least one input document page image that depicts a corresponding page of a document, the semantic token extraction module 1505 generates a plurality of semantic tokens that includes a plurality of word tokens and a plurality of special tokens. Each semantic token is a sequence of characters (also called a string) that represents a semantic element of the document. A semantic token may be a word token or a special token. A word token is a word or other textual semantic element of the document. A special token is a string that represents a non-textual semantic element of the document image. In paragraph [0097]-ZENG discloses the semantic token extraction module 1505 may be implemented to include a token detection module 1510 and a text recognition module 1540. The token detection module 1510 detects a plurality of semantic tokens within the document image and predicts, for each of the detected semantic tokens, a token type of the token); 
adding special tokens that assist in classification and in separating segments of the text (Fig. 15. Paragraph [0098]-ZENG discloses for each detected special token, the token detection module 1510 also indicates positional information for the token (e.g., the 1D and 2D positional information as described above). The token detection module 1510 may also indicate positional information for each of the predicted word tokens (e.g., the 1D and 2D positional information as described above). Additionally or alternatively, the text recognition module 1540 may indicate positional information for each of the word tokens. Further in paragraph [0096]-ZENG discloses the semantic tokens of the document image are modeled as a sequence of semantic tokens (e.g., in a raster (line-by-line) order), and the 1D positional information indicates an index of the semantic token within that sequence. The semantic tokens of the document image may be grouped into segments (e.g., lines of text, paragraphs, blocks of text, tables, etc.), each of which may be modeled as a sequence of semantic tokens, and the segments of the document image are modeled as a sequence of segments (e.g., in a raster order). The positional information for each semantic token may include the index of the semantic token within its corresponding segment and may also include 2D positional information, such as the index of the corresponding segment within the sequence of segments. Please also read paragraph [0093-0095, 0097 and 0103]); 
passing each of the tokens through the transformer layers (Fig. 15. Paragraph [0099]-ZENG discloses the token detection module 1510 may be implemented using a pre-trained token detection model, which may include, for example, a transformer model and/or a convolutional neural network (CNN) model. A transformer model includes a stack of layers having attention mechanisms, such as multi-head self-attention mechanisms (MHSA). Further in paragraph [0104]-ZENG discloses the semantic token embedding module 1560 may be implemented to include a token embedding matrix (e.g., of a BERT or other language transformer model) that has been extended to handle special tokens. The semantic token embedding module 1560 is initialized using a pre-trained word embedding matrix, and the matrix is further trained (e.g., fine-tuned) to embed special tokens as well to produce the token embedding matrix. In paragraph [0107]-ZENG discloses the positional information of a semantic token may include 2D positional information that indicates a location of the token within the document image (also called “layout information”). 2D positional information of a semantic token may be included in the semantic token embedding and/or may be combined with (e.g., added to) the semantic token embedding in an input layer of the multimodal fusion module 1590 (e.g., as described below). The module performing the layout embedding may include using embedding layers to embed (normalized) x-axis, y-axis, width and/or height features separately);
wherein the transformer layers comprise an attention mechanism for contextually informing each token based on other tokens of the text (Fig. 15. Paragraph [0099]-ZENG discloses the token detection module 1510 may be implemented using a pre-trained token detection model, which may include, for example, a transformer model and/or a convolutional neural network (CNN) model. A transformer model includes a stack of layers having attention mechanisms, such as multi-head self-attention mechanisms (MHSA)); 
obtaining token embeddings for the tokens as output of the transformer layers (Fig. 15. Paragraph [0103]-ZENG discloses the semantic token embedding module 1560 generates, for each of the semantic tokens, a corresponding semantic token embedding); and 
pooling the token embeddings to yield the contextual text embedding (Fig. 15. Paragraph [0103]-ZENG discloses the information embedded in each semantic token embedding may include the content of the corresponding semantic token (e.g., textual content for a word token, token definition for a predicted special token (‘_c_’, ‘_sg_’, ‘_st_’)). The information embedded in each semantic token embedding may also include the token type of the corresponding semantic token (e.g., ‘0’ for word, ‘1’ for checkbox, ‘2’ for signature, ‘3’ for stample). The semantic token embedding module 1560 may be configured to translate a vector that represents such information of the semantic token into a lower-dimensional space. In paragraph [0105]-ZENG discloses the information included in each semantic token embedding may also include positional information of the corresponding semantic token. The semantic token embedding module 1560 may add 1D positional information and possibly 2D positional information (e.g., segment index) of the token to the content embedding generated by the token embedding matrix to produce the corresponding semantic token embedding. Each positional embedding may have the same dimension as the content embeddings, such that the various embeddings for each semantic token can be combined by summation. Please also read paragraph [0107]).  

Regarding claim 6, ZENG in view of ZHANG and in further view of LUO explicitly teach the method of claim 1, ZENG further teaches further comprising: 
fusing the reduced dimensionality dense image embedding, the reduced dimensionality contextual text embedding (Fig. 15. Paragraph [0103]-ZENG discloses the semantic token embedding module 1560 may be configured to translate a vector that represents such information of the semantic token into a lower-dimensional space. Further in paragraph [0111]-ZENG discloses the visual token embedding module 1580 may apply a trainable linear projection to each flattened visual token to align the dimensionality of the resulting visual token embeddings with that of the semantic token embedding. In paragraph [0115]-ZENG discloses the multimodal fusion module may include an input processing module 1910, a transformer model 1920, and one or more output layers 1930. The input processing module 1910 may assemble the input to the trained model by concatenating the semantic token embeddings and the visual token embeddings. The input processing module 1910 may also embed the layout information (e.g., may include one or more embedding layers) and combine the layout embeddings with (e.g., add them to) the corresponding token embeddings); and 
ZENG in view of ZHANG fail to explicitly teach determined the numerical score based on the third dense neural sub-network and the fused embedding.  
However, LUO explicitly teaches determined the numerical score (Fig. 15. Paragraph [0149]-LUO discloses a leave-one-out evaluation strategy is used: reserve the latest interacted item for every user as the test item. Then, a list of items is generated and the items are ranked using predicted score. In the setting, t items are recommended to a user each time and the following two metrics are used to evaluate the performance of different algorithms: Hit Ratio (HR) measuring the chance that our recommendation list contains users' interested items and a weighted version of HR, termed Normalized Discounted Cumulative Gain (NDCG), which puts more weight on items that are ranked higher in the recommendation list) based on the third dense neural sub-network and the fused embedding (Fig. 15. Paragraph [0109]-LUO discloses the contextual information extraction module 304 learns the adaptive embeddings of drifting user and item attributes and user-item interactions. These adaptive embeddings can be obtained from multimodal information (e.g., categorical, numerical, textual, image). The module uses a feature crossing layer and the attention mechanism (layer) for better representation learning of the feature interaction. In paragraph [0111]-LUO discloses the invention uses one-hot embedding of the categorical data initially, and feeds the sparse one-hot vectors to the feature crossing layer to obtain dense vectors later. In paragraph [0112]-LUO discloses dense features (e.g., user ages) are normalized using min-max scalers and then fed into the feature crossing layer. In paragraph [0113]-LUO discloses a pre-trained model called Sentence-BERT, can be used to process the raw data and get fixed-length vectors as output. In paragraph [0114]-LUO discloses due to the high dimensionality of raw images, a pre-trained convolutional neural network (CNN) can be used to process the images and get fixed-length vectors as output. In paragraph [0116]-LUO discloses after one-hot embedding of categorical features, preprocessing of dense features, and embedding of text and image data, embeddings of user and item attributes and user-item interactions can be obtained. Instead of using the embedding vector x.sub.0(t) directly, it is passed through a feature crossing network to learn high-order interaction across features and a fusion layer to reduce the dimensionality).  
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to combine the teachings of ZENG in view of ZHANG and in further view of LOU of having a method comprising: determining a reduced dimensionality dense image embedding from a dense image embedding using a first pretrained dense neural sub-network, the dense image embedding encapsulating features of a digital image associated with a multimodal content, with the teachings of LUO having determined the numerical score based on the third dense neural sub-network and the fused embedding.  
Wherein having ZENG’s method having determined the numerical score based on the third dense neural sub-network and the fused embedding.  
The motivation behind the modification would have been to obtain a method that improves data processing and predictions using transformer models, since both ZENG and LUO concern systems and methods for processing multi-modal data using transformer neural networks. Wherein ZENG provides systems and methods that encodes data and generates embeddings using transformer networks to improve entity extraction and modeling predictions, which improves overall document processing workflow, while LUO provides systems and methods that improve the performance of transformer models and help to disentangle the relative static user-item interactions and the more dynamic contexts to provide improved methods. Please see ZENG et al. (US 20230351115 A1), Abstract and Paragraph [0035-0038] and LUO et al. (US 20240184835 A1), Abstract and Paragraph [0051, 0150, 0155].

Regarding claim 8, ZENG in view of ZHANG and in further view of LUO explicitly teach the method of claim 1, ZENG further teaches wherein the reduced dimensionality dense image embedding and the reduced dimensionality contextual text embedding (Fig. 15. Paragraph [0103]-ZENG discloses the semantic token embedding module 1560 generates, for each of the semantic tokens, a corresponding semantic token embedding. Further in paragraph [0111]-ZENG discloses the visual token embedding module 1580 may apply a trainable linear projection to each flattened visual token to align the dimensionality of the resulting visual token embeddings with that of the semantic token embedding). 

Regarding claim 9, ZENG explicitly teaches system (Fig. 14, #1400 and 1500 called a computing device and an document image processing system, respectively. Paragraph [0085 and 0090]) comprising: 
at least one processor (Fig. 14, #1410 called a processor. Paragraph [0085]); 
memory storing instructions to be executed by the at least one processor (Fig. 14, #1420 called memory. Paragraph [0085]), the instructions for: 
in a machine learning pipeline stored in the memory and executed by the at least one processor (Fig. 14. Paragraph [0085]-ZENG discloses FIG. 14 shows an example computing device 1400 suitable for implementing aspects of the techniques and technologies. Please also see Fig. 15 and read paragraph [0090]): 
determining, by a second pretrained dense neural sub-network (Fig. 15. Paragraph [0099]-ZENG discloses the token detection module 1510 may be implemented using a pre-trained token detection model, which may include, for example, a transformer model and/or a convolutional neural network (CNN) model. In paragraph [0104]-ZENG discloses the semantic token embedding module 1560 may be implemented to include a token embedding matrix (e.g., of a BERT or other language transformer model) that has been extended to handle special tokens. Further in paragraph [0113]-ZENG discloses the visual token embedding module 1580 may be implemented to include an image embedding portion of a vision transformer or multimodal transformer. In paragraph [0117]-ZENG discloses the transformer model 1920 may be pre-trained as a multimodal fusion transformer model. In paragraph [0122]-ZENG discloses the one or more output layers 1930 of the multimodal fusion module 1590 may be configured according to a desired downstream task. Such layers may include, for example, one or more of any of the following: a linear (e.g., fully connected) layer, a feed-forward neural network (e.g., multi-layer perceptron), a softmax layer. Please also read paragraph [0061]), a reduced dimensionality contextual text embedding from a contextual text embedding, the contextual text embedding encapsulating features of a text associated with the multimodal content (Fig. 15. Paragraph [0103]-ZENG discloses the semantic token embedding module 1560 generates, for each of the semantic tokens, a corresponding semantic token embedding. The information embedded in each semantic token embedding may include the content of the corresponding semantic token (e.g., textual content for a word token, token definition for a predicted special token (‘_c_’, ‘_sg_’, ‘_st_’)). The information embedded in each semantic token embedding may also include the token type of the corresponding semantic token (e.g., ‘0’ for word, ‘1’ for checkbox, ‘2’ for signature, ‘3’ for stample). The semantic token embedding module 1560 may be configured to translate a vector that represents such information of the semantic token into a lower-dimensional space); 
fusing the reduced dimensionality dense image embedding and the reduced dimensionality contextual text embedding to yield a fused embedding (Fig. 15. Paragraph [0115]-ZENG discloses the multimodal fusion module may include an input processing module 1910, a transformer model 1920, and one or more output layers 1930. The input processing module 1910 may assemble the input to the trained model by concatenating the semantic token embeddings and the visual token embeddings. The input processing module 1910 may also embed the layout information (e.g., may include one or more embedding layers) and combine the layout embeddings with (e.g., add them to) the corresponding token embeddings); 
Although ZENG explicitly teaches determining, by a first pretrained neural sub-network of the machine learning pipeline (Fig. 15. Paragraph [0099]-ZENG discloses the token detection module 1510 may be implemented using a pre-trained token detection model, which may include a transformer model and/or a convolutional neural network (CNN) model. In paragraph [0104]-ZENG discloses the semantic token embedding module 1560 may be implemented to include a token embedding matrix (e.g., of a BERT or other language transformer model) that has been extended to handle special tokens. Further in paragraph [0113]-ZENG discloses the visual token embedding module 1580 may be implemented to include an image embedding portion of a vision transformer or multimodal transformer. In paragraph [0117]-ZENG discloses the transformer model 1920 may be pre-trained as a multimodal fusion transformer model. In paragraph [0122]-ZENG discloses the one or more output layers 1930 of the multimodal fusion module 1590 may include one or more of any of the following: a linear (e.g., fully connected) layer, a feed-forward neural network (e.g., multi-layer perceptron), a softmax layer), a reduced dimensionality dense image embedding from a dense image embedding, the dense image embedding encapsulating features of a digital image associated with a multimodal content (Fig. 15. Paragraph [0103]-ZENG discloses the semantic token embedding module 1560 may be configured to translate a vector that represents such information of the semantic token into a lower-dimensional space. Further in paragraph [0111]-ZENG discloses the visual token embedding module 1580 may apply a trainable linear projection to each flattened visual token to align the dimensionality of the resulting visual token embeddings with that of the semantic token embedding); 
ZENG is silent on a first pretrained dense neural sub-network.
However, ZHANG explicitly teaches determining, by a first pretrained dense neural sub-network of the machine learning pipeline (Fig. 3. Paragraph [0044]-ZHANG discloses image encoder 325 may be a convolutional neural network model, such as an EfficientNet model, and may encode event image data 310-b and 310-c. The convolutional neural network model may be trained on an ImageNet dataset. In paragraph [0045]-ZHANG discloses after encoding, the embeddings may be normalized before being included in a set of vectors for each of the entity multimodal model and the event multimodal. A Dense neural network model 340 may be used to encode embeddings of a given dimension into a D-dimensional embedding (or other dimensional embedding supported for training a sequential model). Dense neural network model 340-a may encode each categorical data point of the entity category data 305-c, and generate an embedding 335 that is C-dimensional. A dense neural network model 340-b may encode embedding 330 having G-dimension into an embedding of a D-dimension that is included in the set of vectors 350. Dense neural network model 340-c may encode embedding 335 having C-dimension into an embedding of a D-dimension that is included in the set of vectors 350. Dense neural network model 340-d may encode embeddings 365 and 370, each having E-dimension into respective embeddings of D-dimension that are included in the set of vectors 380)), a dimensionality dense image embedding from a dense image embedding, the dense image embedding encapsulating features of a digital image associated with a multimodal content (Fig. 3. Paragraph [0041]-ZHANG discloses an entity may have multimodal data 305, such as entity text data 305-a, entity graph data 305-b, entity category data 305-c, among others (numeric, etc.), and an event may have multimodal data 310, such as event text data 310-a, and event image data 310-b and 310-c. The multimodal data 305 of the entity and the multimodal data 310 of the event may have data associated with different formats, may have missing data of a given modality, may be of different dimensions, etc. Multimodal data 305 of the entity can apply to individual customers and customer organizations (e.g., companies, businesses), and multimodal data 310 of the event may apply to events, tracks, and sessions).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to combine the teachings of ZENG having a system comprising: at least one processor; memory storing instructions to be executed by the at least one processor, the instructions for: in a machine learning pipeline stored in the memory and executed by the at least one processor: determining, by a first pretrained dense neural sub-network of the machine learning pipeline, with the teachings of ZHANG having determining, by a first pretrained dense neural sub-network of the machine learning pipeline, a dimensionality dense image embedding from a dense image embedding, the dense image embedding encapsulating features of a digital image associated with a multimodal content.
Wherein having ZENG’s system having determining, by a first pretrained dense neural sub-network of the machine learning pipeline, a reduced dimensionality dense image embedding from a dense image embedding, the dense image embedding encapsulating features of a digital image associated with a multimodal content.
The motivation behind the modification would have been to obtain a system that improves data processing and predictions, since both ZENG and ZHANG concern systems and methods for processing multi-modal data using transformer neural networks. Wherein ZENG provides systems and methods that encodes data and generates embeddings using transformer networks to improve entity extraction and modeling predictions, which improves overall document processing workflow, while ZHANG provides systems and methods that encodes data and generates embeddings using transformer networks to improve user predictions and recommendations. Please see ZENG et al. (US 20230351115 A1), Abstract and Paragraph [0035-0038] and ZHANG et al. (US 20220237480 A1), Abstract and Paragraph [0025-0027 and 0050].
ZENG in view of ZHANG fail to explicitly teach determining a numerical score for the multimodal content using a third dense neural sub-network of the machine learning pipeline and the fused embedding; and ranking the multimodal content based on the numerical score.  
determining a numerical score for the multimodal content using a third dense neural sub-network of the machine learning pipeline and the fused embedding; and ranking the multimodal content based on the numerical score.  
However, LUO explicitly teaches determining a numerical score (Fig. 5. Paragraph [0120]-LUO discloses FIG. 6 illustrates operation of the user interest modelling module 306 in one embodiment. The graph embedding of the target item and user's recent interacted item are concatenated, fed into an MLP, and then the softmax function is used to get the attention weight. The attention weight is multiplied with the user's recent interacted item embeddings and the weighted sum is used to represent user interests. In paragraph [0130]-LUO discloses the inventors of the invention have realized that a tower-shaped MLP structure can be used for predicting ratings in recommender system. In paragraph [0149]-LUO discloses in the experiment a leave-one-out evaluation strategy is used: reserve the latest interacted item for every user as the test item. Then, a list of items is generated and the items are ranked using predicted score. In the setting, t items are recommended to a user each time and the following two metrics are used to evaluate the performance of different algorithms: Hit Ratio (HR) measuring the chance that our recommendation list contains users' interested items and a weighted version of HR, termed Normalized Discounted Cumulative Gain (NDCG), which puts more weight on items that are ranked higher in the recommendation list) for the multimodal content (Fig. 5. Paragraph [0075]-LUO discloses multimodal contextual information is incorporated via various embedding techniques, like specific pre-trained models for texts and images (e.g., Sentence-BERT and ResNet), pre-processing techniques (e.g., normalization and feature crossing) for other categorical and/or dense contextual features. To reduce the feature dimensionality and learn the feature interaction, the generated feature vectors are fed into the feature crossing layer to learn the higher-order non-linear feature interactions. After obtaining the extracted graph embedding and contextual embedding, the self-attention mechanism can be used to model the user interests. In paragraph [0116]-LUO discloses after one-hot embedding of categorical features, preprocessing of dense features, and embedding of text and image data, embeddings of user and item attributes and user-item interactions can be obtained. Instead of using the embedding vector x.sub.0(t) directly, it is passed through a feature crossing network to learn high-order interaction across features and a fusion layer to reduce the dimensionality (wherein the embedding vector of all the features are expressed and concatenated, and the fusion layer uses a self-attention mechanism to obtain a final embedding of the contextual information). Further in paragraph [0129]-LUO discloses with respect to global interaction learning, attention mechanism is also adopted. Specifically, all the embedding are concatenated. Please also read paragraph [0123-0128]) using a third dense neural sub-network of the machine learning pipeline and the fused embedding (Fig. 5. Paragraph [0113]-LUO discloses a pre-trained model called Sentence-BERT, can be used to process the raw data and get fixed-length vectors as output. In paragraph [0114]-LUO discloses due to the high dimensionality of raw images, a pre-trained convolutional neural network (CNN) can be used to process the images and get fixed-length vectors as output. In paragraph [0131]-LUO discloses a DNN prediction model, in particular a MLP 310, is used as the prediction network. The input of the MLP 310 includes the global feature representation generated by the global interaction module. The MLP 310 has fully connected layers); and 
ranking the multimodal content based on the numerical score (Fig. 5. Paragraph [0149]-LUO discloses in the experiment a leave-one-out evaluation strategy is used: reserve the latest interacted item for every user as the test item. Then, a list of items is generated and the items are ranked using predicted score. In the setting, t items are recommended to a user each time and the following two metrics are used to evaluate the performance of different algorithms: Hit Ratio (HR) measuring the chance that our recommendation list contains users' interested items and a weighted version of HR, termed Normalized Discounted Cumulative Gain (NDCG), which puts more weight on items that are ranked higher in the recommendation list).  
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to combine the teachings of ZENG in view of ZHANG of having a system comprising: at least one processor; memory storing instructions to be executed by the at least one processor, the instructions for: in a machine learning pipeline stored in the memory and executed by the at least one processor: determining, by a first pretrained dense neural sub-network of the machine learning pipeline, with the teachings of LUO having determining a numerical score for the multimodal content using a third dense neural sub-network of the machine learning pipeline and the fused embedding; and ranking the multimodal content based on the numerical score.  
Wherein having ZENG’s system having determining a numerical score for the multimodal content using a third dense neural sub-network of the machine learning pipeline and the fused embedding; and ranking the multimodal content based on the numerical score.  
The motivation behind the modification would have been to obtain a system that improves data processing and predictions using transformer models, since both ZENG and LUO concern systems and methods for processing multi-modal data using transformer neural networks. Wherein ZENG provides systems and methods that encodes data and generates embeddings using transformer networks to improve entity extraction and modeling predictions, which improves overall document processing workflow, while LUO provides systems and methods that improve the performance of transformer models and help to disentangle the relative static user-item interactions and the more dynamic contexts to provide improved methods. Please see ZENG et al. (US 20230351115 A1), Abstract and Paragraph [0035-0038] and LUO et al. (US 20240184835 A1), Abstract and Paragraph [0051, 0150, 0155].

Regarding claim 10, ZENG in view of ZHANG and in further view of LUO explicitly teach he system of claim 9, ZENG further teaches further comprising instructions for: determining, by a pretrained convolutional neural network pipeline, the dense image embedding from the digital image (Fig. 15. Paragraph [0103]-ZENG discloses the semantic token embedding module 1560 may be configured to translate a vector that represents such information of the semantic token into a lower-dimensional space. Further in paragraph [0111]-ZENG discloses the visual token embedding module 1580 may apply a trainable linear projection to each flattened visual token to align the dimensionality of the resulting visual token embeddings with that of the semantic token embedding. In paragraph [0115]-ZENG discloses the multimodal fusion module may include an input processing module 1910, a transformer model 1920, and one or more output layers 1930. The input processing module 1910 may assemble the input to the trained model by concatenating the semantic token embeddings and the visual token embeddings. The input processing module 1910 may also embed the layout information (e.g., may include one or more embedding layers) and combine the layout embeddings with (e.g., add them to) the corresponding token embeddings).  

Regarding claim 11, ZENG in view of ZHANG and in further view of LUO explicitly teach the system of claim 9, ZENG further teaches further comprising instructions for: determining, by a pretrained transformer neural network pipeline (Fig. 15. Paragraph [0099]-ZENG discloses the token detection module 1510 may be implemented using a pre-trained token detection model, which may include, for example, a transformer model and/or a convolutional neural network (CNN) model. In paragraph [0104]-ZENG discloses the semantic token embedding module 1560 may be implemented to include a token embedding matrix (e.g., of a BERT or other language transformer model) that has been extended to handle special tokens), the contextual text embedding from the text (Fig. 15. Paragraph [0103]-ZENG discloses the semantic token embedding module 1560 generates, for each of the semantic tokens, a corresponding semantic token embedding. The information embedded in each semantic token embedding may include the content of the corresponding semantic token (e.g., textual content for a word token, token definition for a predicted special token (‘_c_’, ‘_sg_’, ‘_st_’)). The information embedded in each semantic token embedding may also include the token type of the corresponding semantic token (e.g., ‘0’ for word, ‘1’ for checkbox, ‘2’ for signature, ‘3’ for stample)).  

Regarding claim 15, ZENG in view of ZHANG and in further view of LUO explicitly teach the system of claim 9, ZENG further teaches wherein the reduced dimensionality dense image embedding (Fig. 15. Paragraph [0111]-ZENG discloses the visual token embedding module 1580 generates, for each of the visual tokens, a corresponding visual token embedding. The visual token embedding module 1580 may apply a trainable linear projection to each flattened visual token to align the dimensionality of the resulting visual token embeddings with that of the semantic token embeddings), the reduced dimensionality contextual text embedding, and the additional feature embedding each have a same dimensionality (Fig. 15. Paragraph [0093]-ZENG discloses the semantic token extraction module 1505 generates a plurality of semantic tokens that includes a plurality of word tokens and a plurality of special tokens. In paragraph [0103]-ZENG discloses the semantic token embedding module 1560 generates, for each of the semantic tokens, a corresponding semantic token embedding. The information embedded in each semantic token embedding may include the content of the corresponding semantic token (e.g., textual content for a word token, token definition for a predicted special token (‘_c_’, ‘_sg_’, ‘_st_’)). The information embedded in each semantic token embedding may also include the token type of the corresponding semantic token (e.g., ‘0’ for word, ‘1’ for checkbox, ‘2’ for signature, ‘3’ for stample). The semantic token embedding module 1560 may be configured to translate a vector that represents such information of the semantic token into a lower-dimensional space. In paragraph [0104]-ZENG discloses the semantic token embedding module 1560 is initialized using a pre-trained word embedding matrix, and the matrix is further trained (e.g., fine-tuned) to embed special tokens (wherein special tokens correspond to non-textual or additional features such as checkboxes or signatures). Please also read paragraph [0115]).

Regarding claim 16, ZENG explicitly teaches non-transitory computer-readable medium storing instructions (Fig. 14. Paragraph [0085]-ZENG discloses FIG. 14 shows a computing device 1400 suitable for implementing aspects of the techniques and technologies. Computing device 1400 includes a processor 1410 which is in communication with a memory 1420. The processor 1410 is configured to execute processor-executable instructions stored in the memory 1420) which, when executed by at least one programmable electronic device, cause the at least one programmable electronic device to perform operations (Fig. 14. Paragraph [0090]-ZENG discloses FIG. 15 shows a block diagram of a document image processing system 1500. The document image processing system 1500 includes a semantic token extraction module 1505, a semantic token embedding module 1560, a visual token extraction module 1570, a visual token embedding module 1580, and a multimodal fusion module 1590) comprising: 
determining a reduced dimensionality contextual text embedding from a contextual text embedding using a second pretrained dense neural sub-network (Fig. 15. Paragraph [0099]-ZENG discloses the token detection module 1510 may be implemented using a pre-trained token detection model, which may include, for example, a transformer model and/or a convolutional neural network (CNN) model. In paragraph [0104]-ZENG discloses the semantic token embedding module 1560 may be implemented to include a token embedding matrix (e.g., of a BERT or other language transformer model) that has been extended to handle special tokens. Further in paragraph [0113]-ZENG discloses the visual token embedding module 1580 may be implemented to include an image embedding portion of a vision transformer or multimodal transformer. In paragraph [0117]-ZENG discloses the transformer model 1920 may be pre-trained as a multimodal fusion transformer model. In paragraph [0122]-ZENG discloses the one or more output layers 1930 of the multimodal fusion module 1590 may be configured according to a desired downstream task. Such layers may include, for example, one or more of any of the following: a linear (e.g., fully connected) layer, a feed-forward neural network (e.g., multi-layer perceptron), a softmax layer. Please also read paragraph [0061]), the contextual text embedding encapsulating features of a text associated with the multimodal content (Fig. 15. Paragraph [0103]-ZENG discloses the semantic token embedding module 1560 generates, for each of the semantic tokens, a corresponding semantic token embedding. The information embedded in each semantic token embedding may include the content of the corresponding semantic token (e.g., textual content for a word token, token definition for a predicted special token (‘_c_’, ‘_sg_’, ‘_st_’)). The information embedded in each semantic token embedding may also include the token type of the corresponding semantic token (e.g., ‘0’ for word, ‘1’ for checkbox, ‘2’ for signature, ‘3’ for stample). The semantic token embedding module 1560 may be configured to translate a vector that represents such information of the semantic token into a lower-dimensional space); 
Although ZENG explicitly teaches determining a reduced dimensionality dense image embedding from a dense image embedding using a first pretrained neural sub-network (Fig. 15. Paragraph [0099]-ZENG discloses the token detection module 1510 may be implemented using a pre-trained token detection model, which may include a transformer model and/or a convolutional neural network (CNN) model. In paragraph [0104]-ZENG discloses the semantic token embedding module 1560 may be implemented to include a token embedding matrix (e.g., of a BERT or other language transformer model) that has been extended to handle special tokens. Further in paragraph [0113]-ZENG discloses the visual token embedding module 1580 may be implemented to include an image embedding portion of a vision transformer or multimodal transformer. In paragraph [0117]-ZENG discloses the transformer model 1920 may be pre-trained as a multimodal fusion transformer model. In paragraph [0122]-ZENG discloses the one or more output layers 1930 of the multimodal fusion module 1590 may include one or more of any of the following: a linear (e.g., fully connected) layer, a feed-forward neural network (e.g., multi-layer perceptron), a softmax layer), the dense image embedding encapsulating features of a digital image associated with a multimodal content (Fig. 15. Paragraph [0103]-ZENG discloses the semantic token embedding module 1560 may be configured to translate a vector that represents such information of the semantic token into a lower-dimensional space. Further in paragraph [0111]-ZENG discloses the visual token embedding module 1580 may apply a trainable linear projection to each flattened visual token to align the dimensionality of the resulting visual token embeddings with that of the semantic token embedding); 
ZENG is silent on a first pretrained dense neural sub-network of the machine learning pipeline.
However, ZHANG explicitly teaches determining a dimensionality dense image embedding from a dense image embedding using a first pretrained dense neural sub-network (Fig. 3. Paragraph [0044]-ZHANG discloses image encoder 325 may be a convolutional neural network model, such as an EfficientNet model, and may encode event image data 310-b and 310-c. The convolutional neural network model may be trained on an ImageNet dataset. In paragraph [0045]-ZHANG discloses after encoding, the embeddings may be normalized before being included in a set of vectors for each of the entity multimodal model and the event multimodal. A Dense neural network model 340 may be used to encode embeddings of a given dimension into a D-dimensional embedding (or other dimensional embedding supported for training a sequential model). Dense neural network model 340-a may encode each categorical data point of the entity category data 305-c, and generate an embedding 335 that is C-dimensional. A dense neural network model 340-b may encode embedding 330 having G-dimension into an embedding of a D-dimension that is included in the set of vectors 350. Dense neural network model 340-c may encode embedding 335 having C-dimension into an embedding of a D-dimension that is included in the set of vectors 350. Dense neural network model 340-d may encode embeddings 365 and 370, each having E-dimension into respective embeddings of D-dimension that are included in the set of vectors 380)), the dense image embedding encapsulating features of a digital image (Fig. 3. Paragraph [0044]-ZHANG discloses for image data (e.g., event image data 310-b and 310-c), one or more image encoders 325 may be used. The convolutional neural network model may be trained to encode both event image data 310-b and 310-c into sequences of N embeddings, such as embedding 365 and embedding 370, where each embedding is E-dimensional) associated with a multimodal content (Fig. 3. Paragraph [0041]-ZHANG discloses an entity may have multimodal data 305, such as entity text data 305-a, entity graph data 305-b, entity category data 305-c, among others (numeric, etc.), and an event may have multimodal data 310, such as event text data 310-a, and event image data 310-b and 310-c. The multimodal data 305 of the entity and the multimodal data 310 of the event may have data associated with different formats, may have missing data of a given modality, may be of different dimensions, etc. Multimodal data 305 of the entity can apply to individual customers and customer organizations (e.g., companies, businesses), and multimodal data 310 of the event may apply to events, tracks, and sessions).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to combine the teachings of ZENG having a non-transitory computer-readable medium storing instructions which, when executed by at least one programmable electronic device, cause the at least one programmable electronic device to perform operations comprising: determining a reduced dimensionality dense image embedding from a dense image embedding using a first pretrained dense neural sub-network, with the teachings of ZHANG having determining, by a first pretrained dense neural sub-network of the machine learning pipeline, a dimensionality dense image embedding from a dense image embedding, the dense image embedding encapsulating features of a digital image associated with a multimodal content.
Wherein having ZENG’s non-transitory computer-readable medium having determining a reduced dimensionality dense image embedding from a dense image embedding using a first pretrained dense neural sub-network, the dense image embedding encapsulating features of a digital image associated with a multimodal content
The motivation behind the modification would have been to obtain a non-transitory computer-readable medium that improves data processing and predictions, since both ZENG and ZHANG concern systems and methods for processing multi-modal data using transformer neural networks. Wherein ZENG provides systems and methods that encodes data and generates embeddings using transformer networks to improve entity extraction and modeling predictions, which improves overall document processing workflow, while ZHANG provides systems and methods that encodes data and generates embeddings using transformer networks to improve user predictions and recommendations. Please see ZENG et al. (US 20230351115 A1), Abstract and Paragraph [0035-0038] and ZHANG et al. (US 20220237480 A1), Abstract and Paragraph [0025-0027 and 0050].
Although ZENG explicitly teaches determining a multimodal content using a third dense neural sub-network, the reduced dimensionality dense image embedding and the reduced dimensionality contextual text embedding (Fig. 15. Paragraph [0114]-ZENG discloses the multimodal fusion module 1590 applies a trained model to process an input that is based on the plurality of semantic token embeddings and the plurality of visual token embeddings to generate a semantic processing result that includes at least one of the following: a predicted location for each of a plurality of entities in the document page image, a predicted document type of a document page depicted in the document page image, or a predicted answer to a question about the document page depicted in the document page image. In paragraph [0115]-ZENG discloses the multimodal fusion module may include an input processing module 1910, a transformer model 1920, and one or more output layers 1930. The input processing module 1910 may assemble the input to the trained model by concatenating the semantic token embeddings and the visual token embeddings. The input processing module 1910 may also embed the layout information (e.g., may include one or more embedding layers) and combine the layout embeddings with (e.g., add them to) the corresponding token embeddings. Please also read paragraph [0103 and 0111]).
ZENG in view of ZHANG fail to explicitly teach determining a numerical score of a multimodal content using a third dense neural sub-network.
However, LUO explicitly teaches determining a numerical score (Fig. 5. Paragraph [0120]-LUO discloses FIG. 6 illustrates operation of the user interest modelling module 306 in one embodiment. The graph embedding of the target item and user's recent interacted item are concatenated, fed into an MLP, and then the softmax function is used to get the attention weight. The attention weight is multiplied with the user's recent interacted item embeddings and the weighted sum is used to represent user interests. In paragraph [0130]-LUO discloses the inventors of the invention have realized that a tower-shaped MLP structure can be used for predicting ratings in recommender system. In paragraph [0149]-LUO discloses in the experiment a leave-one-out evaluation strategy is used: reserve the latest interacted item for every user as the test item. Then, a list of items is generated and the items are ranked using predicted score. In the setting, t items are recommended to a user each time and the following two metrics are used to evaluate the performance of different algorithms: Hit Ratio (HR) measuring the chance that our recommendation list contains users' interested items and a weighted version of HR, termed Normalized Discounted Cumulative Gain (NDCG), which puts more weight on items that are ranked higher in the recommendation list) of a multimodal content (Fig. 5. Paragraph [0120]-LUO discloses FIG. 6 illustrates operation of the user interest modelling module 306 in one embodiment. The graph embedding of the target item and user's recent interacted item are concatenated, fed into an MLP, and then the softmax function is used to get the attention weight. The attention weight is multiplied with the user's recent interacted item embeddings and the weighted sum is used to represent user interests. In paragraph [0130]-LUO discloses the inventors of the invention have realized that a tower-shaped MLP structure can be used for predicting ratings in recommender system. In paragraph [0149]-LUO discloses in the experiment a leave-one-out evaluation strategy is used: reserve the latest interacted item for every user as the test item. Then, a list of items is generated and the items are ranked using predicted score. In the setting, t items are recommended to a user each time and the following two metrics are used to evaluate the performance of different algorithms: Hit Ratio (HR) measuring the chance that our recommendation list contains users' interested items and a weighted version of HR, termed Normalized Discounted Cumulative Gain (NDCG), which puts more weight on items that are ranked higher in the recommendation list) using a third dense neural sub-network (Fig. 15. Paragraph [0113]-LUO discloses a pre-trained model called Sentence-BERT, can be used to process the raw data and get fixed-length vectors as output. In paragraph [0114]-LUO discloses due to the high dimensionality of raw images, a pre-trained convolutional neural network (CNN) can be used to process the images and get fixed-length vectors as output. In paragraph [0131]-LUO discloses a DNN prediction model, in particular a MLP 310, is used as the prediction network. The input of the MLP 310 includes the global feature representation generated by the global interaction module. The MLP 310 has fully connected layers).
ranking the multimodal content based on the numerical score (Fig. 5. Paragraph [0149]-LUO discloses in the experiment a leave-one-out evaluation strategy is used: reserve the latest interacted item for every user as the test item. Then, a list of items is generated and the items are ranked using predicted score. In the setting, t items are recommended to a user each time and the following two metrics are used to evaluate the performance of different algorithms: Hit Ratio (HR) measuring the chance that our recommendation list contains users' interested items and a weighted version of HR, termed Normalized Discounted Cumulative Gain (NDCG), which puts more weight on items that are ranked higher in the recommendation list).  
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to combine the teachings of ZENG in view of ZHANG of having a non-transitory computer-readable medium storing instructions which, when executed by at least one programmable electronic device, cause the at least one programmable electronic device to perform operations comprising: determining a reduced dimensionality dense image embedding from a dense image embedding using a first pretrained dense neural sub-network, with the teachings of LUO having determining a numerical score of a multimodal content using a third dense neural sub-network.
Wherein having ZENG’s non-transitory computer-readable medium having determining a numerical score of a multimodal content using a third dense neural sub-network, the reduced dimensionality dense image embedding and the reduced dimensionality contextual text embedding.
The motivation behind the modification would have been to obtain a non-transitory computer-readable medium that improves data processing and predictions using transformer models, since both ZENG and LUO concern systems and methods for processing multi-modal data using transformer neural networks. Wherein ZENG provides systems and methods that encodes data and generates embeddings using transformer networks to improve entity extraction and modeling predictions, which improves overall document processing workflow, while LUO provides systems and methods that improve the performance of transformer models and help to disentangle the relative static user-item interactions and the more dynamic contexts to provide improved methods. Please see ZENG et al. (US 20230351115 A1), Abstract and Paragraph [0035-0038] and LUO et al. (US 20240184835 A1), Abstract and Paragraph [0051, 0150, 0155].

Regarding claim 18, ZENG in view of ZHANG and in further view of LUO explicitly teach the non-transitory computer-readable medium of claim 16, ZENG further teaches wherein the contextual text embedding is generated (Fig. 15. Paragraph [0090]-ZENG discloses FIG. 15 shows a block diagram of a document image processing system 1500 (wherein system 1500 includes embedding module 1560, a visual token extraction module 1570, a visual token embedding module 1580, and a multimodal fusion module 1590, which generate text and image embeddings to produce document predictions using multiple pre-trained neural networks/sub-networks) using a pretrained transformer neural network (Fig. 15. Paragraph [0099]-ZENG discloses the token detection module 1510 may be implemented using a pre-trained token detection model, which may include, for example, a transformer model. A transformer model includes a stack of layers having attention mechanisms, such as multi-head self-attention mechanisms (MHSA). The transformer model may be a vision transformer model. In paragraph [0114]-ZENG discloses the multimodal fusion module 1590 applies a trained model to process an input that is based on the plurality of semantic token embeddings and the plurality of visual token embeddings to generate a semantic processing result that includes at least one of the following: a predicted location for each of a plurality of entities in the document page image, a predicted document type of a document page depicted in the document page image, or a predicted answer to a question about the document page depicted in the document page image. In paragraph [0115]-ZENG discloses the multimodal fusion module may include a transformer model 1920); 
wherein the pretrained transformer neural network comprises transformer layers to capture bidirectional contexts (Fig. 15. Paragraph [0104]-ZENG discloses the semantic token embedding module 1560 may be implemented to include a token embedding matrix (e.g., of a BERT or other language transformer model) that has been extended to handle special tokens (wherein BERT stands for Bi-directional Encoder Representation from Transformers). Please also read paragraph [0061]); and 
wherein the operations further comprise: 
tokenizing the text into tokens (Fig. 15. Paragraph [0093]-ZENG discloses based on at least one input document page image that depicts a corresponding page of a document, the semantic token extraction module 1505 generates a plurality of semantic tokens that includes a plurality of word tokens and a plurality of special tokens. Each semantic token is a sequence of characters (also called a string) that represents a semantic element of the document. A semantic token may be a word token or a special token. A word token is a word or other textual semantic element of the document. A special token is a string that represents a non-textual semantic element of the document image. In paragraph [0097]-ZENG discloses the semantic token extraction module 1505 may be implemented to include a token detection module 1510 and a text recognition module 1540. The token detection module 1510 detects a plurality of semantic tokens within the document image and predicts, for each of the detected semantic tokens, a token type of the token); 
adding special tokens that assist in classification and in separating segments of the text (Fig. 15. Paragraph [0093]-ZENG discloses based on at least one input document page image that depicts a corresponding page of a document, the semantic token extraction module 1505 generates a plurality of semantic tokens that includes a plurality of word tokens and a plurality of special tokens. A special token is a string that represents a non-textual semantic element of the document image. Please also read paragraph [0094-0097); 
passing each of the tokens through the transformer layers (Fig. 15. Paragraph [0099]-ZENG discloses the token detection module 1510 may be implemented using a pre-trained token detection model, which may include, for example, a transformer model. A transformer model includes a stack of layers having attention mechanisms, such as multi-head self-attention mechanisms (MHSA). Further in paragraph [0104]-ZENG discloses the semantic token embedding module 1560 may be implemented to include a token embedding matrix (e.g., of a BERT or other language transformer model) that has been extended to handle special tokens. In paragraph [0107]-ZENG discloses the positional information of a semantic token may include 2D positional information that indicates a location of the token within the document image (also called “layout information”). 2D positional information of a semantic token may be included in the semantic token embedding and/or may be combined with (e.g., added to) the semantic token embedding in an input layer of the multimodal fusion module 1590. The module performing the layout embedding may include using embedding layers to embed (normalized) x-axis, y-axis, width and/or height features separately); 
wherein the transformer layers comprise an attention mechanism for contextually informing each token based on other tokens of the text (Fig. 15. Paragraph [0099]-ZENG discloses the token detection module 1510 may be implemented using a pre-trained token detection model, which may include, for example, a transformer model and/or a convolutional neural network (CNN) model. A transformer model includes a stack of layers having attention mechanisms, such as multi-head self-attention mechanisms (MHSA)); 
obtaining token embeddings for the tokens as output of the transformer layers (Fig. 15. Paragraph [0103]-ZENG discloses the semantic token embedding module 1560 generates, for each of the semantic tokens, a corresponding semantic token embedding. Further in paragraph [0111]-ZENG discloses the visual token embedding module 1580 generates, for each of the visual tokens, a corresponding visual token embedding); and 
pooling the token embeddings to yield the contextual text embedding (Fig. 15. Paragraph [0105]-ZENG discloses the information included in each semantic token embedding may also include positional information of the corresponding semantic token. Each positional embedding may have the same dimension as the content embeddings, such that the various embeddings for each semantic token can be combined by summation. In paragraph [0112]-ZENG discloses each positional embedding may have the same dimension as the flattened visual tokens (e.g., after linear projection), such that the various embeddings for each visual token can be combined by summation. In paragraph [0115]-ZENG discloses the multimodal fusion module may include an input processing module 1910, a transformer model 1920, and one or more output layers 1930. The input processing module 1910 may assemble the input to the trained model by concatenating the semantic token embeddings and the visual token embeddings. The input processing module 1910 may also embed the layout information (e.g., may include one or more embedding layers) and combine the layout embeddings with (e.g., add them to) the corresponding token embeddings). 

Claims 2 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over ZENG et al. (US 20230351115 A1), hereinafter referenced as ZENG in view of ZHANG et al. (US 20220237480 A1), hereinafter referenced as ZHANG and in further view of LUO et al. (US 20240184835 A1), hereinafter referenced as LUO and in further view of STANLEY et al. (US 20230403363 A1), hereinafter referenced as STANLEY and in further view of BHATTACHARYYA et al. (US 20240386712 A1), hereinafter referenced as BHATTACHARYYA.

Regarding claim 2, ZENG in view of ZHANG and in further view of LUO explicitly teach the method of claim 1, ZENG further teaches wherein the dense image embedding (Fig. 15. Paragraph [0111]-ZENG discloses the visual token embedding module 1580 generates, for each of the visual tokens, a corresponding visual token embedding. The visual token embedding module 1580 may generate each visual token embedding by flattening the corresponding visual token. The visual token embedding module 1580 may apply a trainable linear projection to each flattened visual token to align the dimensionality of the resulting visual token embeddings with that of the semantic token embeddings) is generated using a pretrained convolutional neural network (Fig. 15. Paragraph [0099]-ZENG discloses the token detection module 1510 may be implemented using a pre-trained token detection model, which may include a transformer model and/or a convolutional neural network (CNN) model. In paragraph [0104]-ZENG discloses the semantic token embedding module 1560 may be implemented to include a token embedding matrix (e.g., of a BERT or other language transformer model) that has been extended to handle special tokens. Further in paragraph [0113]-ZENG discloses the visual token embedding module 1580 may be implemented to include an image embedding portion of a vision transformer or multimodal transformer. In paragraph [0117]-ZENG discloses the transformer model 1920 may be pre-trained as a multimodal fusion transformer model. In paragraph [0122]-ZENG discloses the one or more output layers 1930 of the multimodal fusion module 1590 may include one or more of any of the following: a linear (e.g., fully connected) layer, a feed-forward neural network (e.g., multi-layer perceptron), a softmax layer.); 
Although ZENG explicitly teaches wherein the pretrained convolutional neural network comprises convolutional layers (Fig. 15. Paragraph [0110]-ZENG discloses the visual token extraction module 1570 produces the visual tokens by generating a feature map from the document page image. The visual token extraction module 1570 may use one or more convolutional layers (e.g., a CNN) to generate the feature map. The document page image may be resized (e.g., to a size of 224×224 pixels) before it is inputted to the convolutional layers, and the resulting feature map may be resized (e.g., by average-pooling));
passing feature maps through connected layers (Fig. 15. Paragraph [0110] In another example, the visual token extraction module 1570 produces the visual tokens by generating a feature map from the document page image. The visual token extraction module 1570 may use one or more convolutional layers (e.g., a CNN) to generate the feature map. The document page image may be resized (e.g., to a size of 224×224 pixels) before it is inputted to the convolutional layers, and the resulting feature map may be resized (e.g., by average-pooling). In this case, the visual token extraction module 1570 may obtain the visual tokens by flattening the feature map. For example, the visual token extraction module 1570 may be configured to flatten a feature map of height H×width W×depth F to obtain a sequence of length HW of visual tokens, where the length of each visual token in the sequence is F.); and 
ZENG in view of ZHANG are silent on wherein the pretrained convolutional neural network comprises convolutional layers and pooling layers; and wherein the pretrained convolutional neural network generates the dense image embedding based on: and obtaining the dense mage embedding as output of the fully connected layers. 
However, STANLEY explicitly teaches wherein the pretrained convolutional neural network comprises convolutional layers and pooling layers (Fig. 5. Paragraph [0053]-STANLEY discloses the neural network architecture 390 can comprise one or more neural networks that are trained to perform functions associated with detecting non-compliant content 365 in the images 360. The neural networks can include convolutional neural networks (CNNs). Each neural network can be configured to analyze images 360 and to execute deep learning functions and/or machine learning functions on the images 360. Each neural network can include a plurality of layers including one or more input layers, one or more output layers, one or more convolutional layers (e.g., that include learnable filters), one or more ReLU layers, one or more pooling layers, one or more fully connected layers, one or more detection layers, one or more upsampling layers, one or more normalization layers, etc. The neural networks and their corresponding layers can be configured to learn and execute various functions for analyzing, interpreting, and understanding the content of the images 360. The functions learned by the neural networks associated with the neural network architecture 390 can include computer vision functions that involve extracting feature information from the images 360. Functions learned by the neural networks can also include functions for performing the object detection, object classification, and/or image classification. Please also read paragraph [0055 and 0066]); and 
wherein the pretrained convolutional neural network (Fig. 4. Paragraph [0066]-STANLEY discloses the feature extraction network 420 can represent a CNN that is configured to extract and/or generate feature embeddings 425 from the images 360. The feature extraction network 420 can be implemented using a deep learning network that is trained to extract the feature embeddings 425. In some embodiments, the feature extraction network 420 can be implemented using VGG-16 developed by Visual Geometry Group (“VGG”). Other deep learning networks may also be utilized to generate feature embeddings 425 from the images 360) generates the dense image embedding (Fig. 4. Paragraph [0068]-STANLEY discloses the feature extraction network 420 may be configured to receive the images 360 provided to the electronic platform 330 and generate a feature embedding 425 corresponding to each of the images 360. The feature extraction network 420 can be trained to generate the feature embeddings 425, which comprise dense feature information, using various supervised, semi-supervised, and/or unsupervised training techniques. In paragraph [0069]-STANLEY discloses the feature embeddings 425 can include multi-dimensional vectors that are derived from raw data associated with the images 360. To generate the feature embeddings 425 and/or associated multi-dimensional vectors, the raw data of the images 360 may be represented in an embedding space having reduced dimensionality. The feature embeddings 425, or their corresponding multi-dimensional vectors, may be utilized to represent the salient or prominent features in the image 360 and/or capture semantic information corresponding to the images 360) based on: 
and obtaining the dense mage embedding (Fig. 4. Paragraph [0068]-STANLEY discloses the feature extraction network 420 may be configured to receive the images 360 provided to the electronic platform 330 and generate a feature embedding 425 corresponding to each of the images 360. The feature extraction network 420 can be trained to generate the feature embeddings 425, which comprise dense feature information, using various supervised, semi-supervised, and/or unsupervised training techniques. In paragraph [0069]-STANLEY discloses the feature embeddings 425 can include multi-dimensional vectors that are derived from raw data associated with the images 360. To generate the feature embeddings 425 and/or associated multi-dimensional vectors, the raw data of the images 360 may be represented in an embedding space having reduced dimensionality. The feature embeddings 425, or their corresponding multi-dimensional vectors, may be utilized to represent the salient or prominent features in the image 360 and/or capture semantic information corresponding to the images 360) as output of the fully connected layers (Fig. 4. Paragraph [0070]-STANLEY discloses the feature embeddings 425 may be extracted from one or more the fully-connected layers 520 located toward the output side of the VGG-16 network). 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to combine the teachings of ZENG in view of ZHANG and in further view of LOU of having a method comprising: determining a reduced dimensionality dense image embedding from a dense image embedding using a first pretrained dense neural sub-network, the dense image embedding encapsulating features of a digital image associated with a multimodal content, with the teachings of STANLEY having wherein the pretrained convolutional neural network comprises convolutional layers and pooling layers; and wherein the pretrained convolutional neural network generates the dense image embedding based on: and obtaining the dense mage embedding as output of the fully connected layers.
Wherein having ZENG’s method having wherein the pretrained convolutional neural network comprises convolutional layers and pooling layers; and wherein the pretrained convolutional neural network generates the dense image embedding based on: and obtaining the dense mage embedding as output of the fully connected layers.
The motivation behind the modification would have been to obtain a method that improves data processing and predictions, since both ZENG and STANLEY concern systems and methods for processing multi-modal data using neural networks. Wherein ZENG provides systems and methods that encodes data and generates embeddings using transformer networks to improve entity extraction and modeling predictions, which improves overall document processing workflow, while STANLEY provides systems and methods that improve identification accuracy, user experiences and speed and efficiency when dealing with large datasets. Please see ZENG et al. (US 20230351115 A1), Abstract and Paragraph [0035-0038] and STANLEY et al. (US 20230403363 A1), Abstract and Paragraph [0096-0097].
ZENG in view of ZHANG fail to explicitly teach extracting hierarchical features from the digital image by applying convolutional operations in parallel to capture the hierarchical features at different scales; passing feature maps through fully connected layers. 
However, BHATTACHARYYA explicitly teaches extracting hierarchical features from the digital image (Fig. 5. Paragraph [0064]-BHATTACHARYYA discloses FIG. 5 is a block diagram illustrating an example architecture of a multi-modal LLM model 500. The multi-modal LLM model 500 may include a convolutional neural network (CNN) 502 and an LLM 506. In paragraph [0065]-BHATTACHARYYA discloses the LRR model 500 may receive an interleaved stream of visual inputs 520 and text inputs 522. In paragraph [0066]-BHATTACHARYYA discloses the CNN 502 may comprise a residual network (ResNet). The CNN 502 may receive the visual inputs I of the interleaved sequence of visual inputs 520 and/or text inputs 522. The CNN 502 may extract lower-level grid-level visual features of the visual inputs I (520) to preserve lower-level visual information. The CNN 502 may encode the visual input sequence I=(v.sub.1, . . . , v.sub.t.sub.v) into a sequence of grid-level features Ī=(v.sub.1, . . . , v.sub.t.sub.v) where v.sub.i=CNN (v.sub.i) and v.sub.i∈custom-character.sup.gxq′, where g represents the size of the grid of the visual input and q′ represents the dimensionality of the CNN embedding space. The grid-level visual features may be supplied to multilayer perceptrons (MLPs) 504a, 504b. The MLPs 504a, 504b may map the grid-level features for use by the LLM 506 such that the LRR model 500 may employ the grid-level visual features obtained from the CNN 502 to perform visual reasoning tasks. In paragraph [0068]-BHATTACHARYYA discloses the LLM 506 may include a self-attention block 508 and a top-down attention block 510. The self-attention block 508 may include self-attention layers 514a, 514b. The self-attention block 508 may receive the text input 522 (e.g., tokens) at a first self-attention layer 514a. The self-attention layers 514a, 514b of the LLM 506 may successively process the input text 522 by applying self-attention. Self-attention may be considered a mechanism that compares each element (e.g., a multi-scale visual feature) of an input with other elements (e.g., the other multi-scale visual features) of the input to identify portions of the input to attend to or focus on) by applying convolutional operations in parallel to capture the hierarchical features at different scales (Fig. 5. Paragraph [0069]-BHATTACHARYYA discloses the first self-attention layer 514a may be referred to a first embedding layer of the LLM 506 and may encode the text inputs 522 (e.g., tokens) whereas subsequent layers of the LLM 506 may encode progressively richer and more information-dense representations, then encode increasingly global information. The embedding layers higher in the hierarchy in the top-down attention mechanism may guide the information extraction process from visual inputs I (520). In paragraph [0070]-BHATTACHARYYA discloses the top-down attention block 510 may serve as a mechanism to enable the LLM 506 to directly extract lower-level visual information from the grid-level features of the visual inputs I (520). The top-down attention block 510 may include cross-attention layers 516a, 516b. The cross-attention layers 516a, 516b may be interleaved between self-attention layers in higher layers (e.g., after the self-attention block 508) of the LLM 506. The cross-attention layers 516a, 516b may exploit the rich hierarchical representation encoded in the hidden states of the LLM 506, where t represents a sequence length t=t.sub.v+t.sub.s and q represents the dimensionality of the embedding space. In paragraph [0071]-The LRR model 500 may employ the cross-attention layers 516a, 516b at higher levels of the hierarchical representation space of the LLM 506 to integrate and “look” for the visual information in the grid-level visual features (v.sub.i) extracted by the CNN 502. The grid level visual features may be transformed using a multi-layer perceptron (MLP) for each cross-attention layer 516a, 516b. Please also read paragraph [0072-0073 and 0077-0080]);
passing feature maps through fully connected layers (Fig. 5. Paragraph [0051]-BHATTACHARYYA discloses the convolution layers 356 may include one or more convolutional filters, which may be applied to the input data to generate a feature map. The normalization layer 358 may normalize the output of the convolution filters. In paragraph [0053]-BHATTACHARYYA discloses the DCN 350 may also include one or more fully connected layers 362 (FC1 and FC2). The output of each of the layers (e.g., 356, 358, 360, 362, 364) may serve as an input of a succeeding one of the layers (e.g., 356, 358, 360, 362, 364) in the DCN 350 to learn hierarchical feature representations from input data 352 (e.g., images, audio, video, sensor data and/or other input data) supplied at the first of the convolution blocks 354A. The output of the DCN 350 is a classification score 366 for the input data 352. The classification score 366 may be a set of probabilities, where each probability is the probability of the input data including a feature from a set of features).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to combine the teachings of ZENG in view of ZHANG and in further view of LOU and in further view of STANLEY of having a method comprising: determining a reduced dimensionality dense image embedding from a dense image embedding using a first pretrained dense neural sub-network, the dense image embedding encapsulating features of a digital image associated with a multimodal content, with the teachings of BHATTACHARYYA having extracting hierarchical features from the digital image by applying convolutional operations in parallel to capture the hierarchical features at different scales; passing feature maps through fully connected layers.
Wherein having ZENG’s method having extracting hierarchical features from the digital image by applying convolutional operations in parallel to capture the hierarchical features at different scales; passing feature maps through fully connected layers.
The motivation behind the modification would have been to obtain a method that improves data processing and predictions, since both ZENG and BHATTACHARYYA concern systems and methods for processing multi-modal data using transformer networks. Wherein ZENG provides systems and methods that encodes data and generates embeddings using transformer networks to improve entity extraction and modeling predictions, which improves overall document processing workflow, while BHATTACHARYYA provides systems and methods that improves visual reasoning using deep learning. Please see ZENG et al. (US 20230351115 A1), Abstract and Paragraph [0035-0038] and BHATTACHARYYA et al. (US 20240386712 A1), Abstract and Paragraph [0006-0011 and 0145].

Regarding claim 17, ZENG in view of ZHANG and in further view of LUO explicitly teach the non-transitory computer-readable medium of claim 16, ZENG further teaches wherein the dense image embedding is generated (Fig. 15. Paragraph [0111]-ZENG discloses the visual token embedding module 1580 generates, for each of the visual tokens, a corresponding visual token embedding. The visual token embedding module 1580 may generate each visual token embedding by flattening the corresponding visual token. The visual token embedding module 1580 may apply a trainable linear projection to each flattened visual token to align the dimensionality of the resulting visual token embeddings with that of the semantic token embeddings) using a pretrained convolutional neural network (Fig. 15. Paragraph [0099]-ZENG discloses the token detection module 1510 may be implemented using a pre-trained token detection model, which may include a transformer model and/or a convolutional neural network (CNN) model. In paragraph [0104]-ZENG discloses the semantic token embedding module 1560 may be implemented to include a token embedding matrix (e.g., of a BERT or other language transformer model) that has been extended to handle special tokens. Further in paragraph [0113]-ZENG discloses the visual token embedding module 1580 may be implemented to include an image embedding portion of a vision transformer or multimodal transformer. In paragraph [0117]-ZENG discloses the transformer model 1920 may be pre-trained as a multimodal fusion transformer model. In paragraph [0122]-ZENG discloses the one or more output layers 1930 of the multimodal fusion module 1590 may include one or more of any of the following: a linear (e.g., fully connected) layer, a feed-forward neural network (e.g., multi-layer perceptron), a softmax layer); 
ZENG in view of ZHANG are silent on wherein the pretrained convolutional neural network comprises convolutional layers and pooling layers; and wherein the operations further comprise: and obtaining the dense mage embedding as output of the fully connected layers.  
However, STANLEY explicitly teaches wherein the pretrained convolutional neural network comprises convolutional layers and pooling layers (Fig. 5. Paragraph [0053]-STANLEY discloses the neural network architecture 390 can comprise one or more neural networks that are trained to perform functions associated with detecting non-compliant content 365 in the images 360. The neural networks can include convolutional neural networks (CNNs). Each neural network can be configured to analyze images 360 and to execute deep learning functions and/or machine learning functions on the images 360. Each neural network can include a plurality of layers including one or more input layers, one or more output layers, one or more convolutional layers (e.g., that include learnable filters), one or more ReLU layers, one or more pooling layers, one or more fully connected layers, one or more detection layers, one or more upsampling layers, one or more normalization layers, etc. The neural networks and their corresponding layers can be configured to learn and execute various functions for analyzing, interpreting, and understanding the content of the images 360. The functions learned by the neural networks associated with the neural network architecture 390 can include computer vision functions that involve extracting feature information from the images 360. Functions learned by the neural networks can also include functions for performing the object detection, object classification, and/or image classification. Please also read paragraph [0055 and 0066]); and 
wherein the operations further comprise: 
and obtaining the dense mage embedding (Fig. 4. Paragraph [0068]-STANLEY discloses the feature extraction network 420 may be configured to receive the images 360 provided to the electronic platform 330 and generate a feature embedding 425 corresponding to each of the images 360. The feature extraction network 420 can be trained to generate the feature embeddings 425, which comprise dense feature information, using various supervised, semi-supervised, and/or unsupervised training techniques. In paragraph [0069]-STANLEY discloses the feature embeddings 425 can include multi-dimensional vectors that are derived from raw data associated with the images 360. To generate the feature embeddings 425 and/or associated multi-dimensional vectors, the raw data of the images 360 may be represented in an embedding space having reduced dimensionality. The feature embeddings 425, or their corresponding multi-dimensional vectors, may be utilized to represent the salient or prominent features in the image 360 and/or capture semantic information corresponding to the images 360) as output of the fully connected layers (Fig. 4. Paragraph [0070]-STANLEY discloses the feature embeddings 425 may be extracted from one or more the fully-connected layers 520 located toward the output side of the VGG-16 network).  
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to combine the teachings of ZENG in view of ZHANG and in further view of LOU of having a non-transitory computer-readable medium storing instructions which, when executed by at least one programmable electronic device, cause the at least one programmable electronic device to perform operations comprising: determining a reduced dimensionality dense image embedding from a dense image embedding using a first pretrained dense neural sub-network, with the teachings of STANLEY having wherein the pretrained convolutional neural network comprises convolutional layers and pooling layers; and wherein the pretrained convolutional neural network generates the dense image embedding based on: and obtaining the dense mage embedding as output of the fully connected layers.
Wherein having ZENG’s non-transitory computer-readable medium having wherein the pretrained convolutional neural network comprises convolutional layers and pooling layers; and wherein the pretrained convolutional neural network generates the dense image embedding based on: passing feature maps through fully connected layers; and obtaining the dense mage embedding as output of the fully connected layers.
The motivation behind the modification would have been to obtain a non-transitory computer-readable medium that improves data processing and predictions, since both ZENG and STANLEY concern systems and methods for processing multi-modal data using neural networks. Wherein ZENG provides systems and methods that encodes data and generates embeddings using transformer networks to improve entity extraction and modeling predictions, which improves overall document processing workflow, while STANLEY provides systems and methods that improve identification accuracy, user experiences and speed and efficiency when dealing with large datasets. Please see ZENG et al. (US 20230351115 A1), Abstract and Paragraph [0035-0038] and STANLEY et al. (US 20230403363 A1), Abstract and Paragraph [0096-0097].
ZENG in view of ZHANG fail to explicitly teach extracting hierarchical features from the digital image by applying convolutional operations in parallel to capture the hierarchical features at different scales; passing feature maps through fully connected layers; 
However, BHATTACHARYYA explicitly teaches extracting hierarchical features from the digital image (Fig. 5. Paragraph [0064]-BHATTACHARYYA discloses FIG. 5 is a block diagram illustrating an example architecture of a multi-modal LLM model 500. The multi-modal LLM model 500 may include a convolutional neural network (CNN) 502 and an LLM 506. In paragraph [0065]-BHATTACHARYYA discloses the LRR model 500 may receive an interleaved stream of visual inputs 520 and text inputs 522. In paragraph [0066]-BHATTACHARYYA discloses the CNN 502 may comprise a residual network (ResNet). The CNN 502 may receive the visual inputs I of the interleaved sequence of visual inputs 520 and/or text inputs 522. The CNN 502 may extract lower-level grid-level visual features of the visual inputs I (520) to preserve lower-level visual information. The CNN 502 may encode the visual input sequence into a sequence of grid-level features, where g represents the size of the grid of the visual input and q′ represents the dimensionality of the CNN embedding space. The grid-level visual features may be supplied to multilayer perceptrons (MLPs) 504a, 504b. The MLPs 504a, 504b may map the grid-level features for use by the LLM 506 such that the LRR model 500 may employ the grid-level visual features obtained from the CNN 502 to perform visual reasoning tasks. In paragraph [0068]-BHATTACHARYYA discloses the LLM 506 may include a self-attention block 508 and a top-down attention block 510. The self-attention block 508 may include self-attention layers 514a, 514b. The self-attention block 508 may receive the text input 522 (e.g., tokens) at a first self-attention layer 514a. The self-attention layers 514a, 514b of the LLM 506 may successively process the input text 522 by applying self-attention. Self-attention may be considered a mechanism that compares each element (e.g., a multi-scale visual feature) of an input with other elements (e.g., the other multi-scale visual features) of the input to identify portions of the input to attend to or focus on) by applying convolutional operations in parallel to capture the hierarchical features at different scales (Fig. 5. Paragraph [0069]-BHATTACHARYYA discloses the first self-attention layer 514a may be referred to a first embedding layer of the LLM 506 and may encode the text inputs 522 (e.g., tokens) whereas subsequent layers of the LLM 506 may encode progressively richer and more information-dense representations, then encode increasingly global information. The embedding layers higher in the hierarchy in the top-down attention mechanism may guide the information extraction process from visual inputs I (520). In paragraph [0070]-BHATTACHARYYA discloses the top-down attention block 510 may serve as a mechanism to enable the LLM 506 to directly extract lower-level visual information from the grid-level features of the visual inputs I (520). The top-down attention block 510 may include cross-attention layers 516a, 516b. The cross-attention layers 516a, 516b may be interleaved between self-attention layers in higher layers (e.g., after the self-attention block 508) of the LLM 506. The cross-attention layers 516a, 516b may exploit the rich hierarchical representation encoded in the hidden states of the LLM 506, where t represents a sequence length and q represents the dimensionality of the embedding space. In paragraph [0071]-BHATTACHARYYA discloses the LRR model 500 may employ the cross-attention layers 516a, 516b at higher levels of the hierarchical representation space of the LLM 506 to integrate and “look” for the visual information in the grid-level visual features (v.sub.i) extracted by the CNN 502. The grid level visual features Ī may be transformed using a multi-layer perceptron (MLP) for each cross-attention layer 516a, 516b. Please also read paragraph [0072-0073 and 0077-0080]); 
passing feature maps through fully connected layers (Fig. 5. Paragraph [0051]-BHATTACHARYYA discloses the convolution layers 356 may include one or more convolutional filters, which may be applied to the input data to generate a feature map. The normalization layer 358 may normalize the output of the convolution filters. In paragraph [0053]-BHATTACHARYYA discloses the DCN 350 may also include one or more fully connected layers 362 (FC1 and FC2). The DCN 350 may further include a logistic regression (LR) layer 364. The output of each of the layers (e.g., 356, 358, 360, 362, 364) may serve as an input of a succeeding one of the layers (e.g., 356, 358, 360, 362, 364) in the DCN 350 to learn hierarchical feature representations from input data 352 (e.g., images, audio, video, sensor data and/or other input data) supplied at the first of the convolution blocks 354A. The output of the DCN 350 is a classification score 366 for the input data 352. The classification score 366 may be a set of probabilities, where each probability is the probability of the input data including a feature from a set of features).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to combine the teachings of ZENG in view of ZHANG and in further view of LOU and in further view of STANLEY of having a non-transitory computer-readable medium storing instructions which, when executed by at least one programmable electronic device, cause the at least one programmable electronic device to perform operations comprising: determining a reduced dimensionality dense image embedding from a dense image embedding using a first pretrained dense neural sub-network, with the teachings of BHATTACHARYYA having extracting hierarchical features from the digital image by applying convolutional operations in parallel to capture the hierarchical features at different scales; passing feature maps through fully connected layers.
Wherein having ZENG’s non-transitory computer-readable medium having extracting hierarchical features from the digital image by applying convolutional operations in parallel to capture the hierarchical features at different scales; passing feature maps through fully connected layers.
The motivation behind the modification would have been to obtain a non-transitory computer-readable medium that improves data processing and predictions, since both ZENG and BHATTACHARYYA concern systems and methods for processing multi-modal data using transformer networks. Wherein ZENG provides systems and methods that encodes data and generates embeddings using transformer networks to improve entity extraction and modeling predictions, which improves overall document processing workflow, while BHATTACHARYYA provides systems and methods that improves visual reasoning using deep learning. Please see ZENG et al. (US 20230351115 A1), Abstract and Paragraph [0035-0038] and BHATTACHARYYA et al. (US 20240386712 A1), Abstract and Paragraph [0006-0011 and 0145].

Claims 4-5, 12-13 and 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over ZENG et al. (US 20230351115 A1), hereinafter referenced as ZENG in view of ZHANG et al. (US 20220237480 A1), hereinafter referenced as ZHANG and in further view of LUO et al. (US 20240184835 A1), hereinafter referenced as LUO and in further view of WARD et al. (US 20200035219 A1), hereinafter referenced as WARD.

Regarding claim 4, ZENG in view of ZHANG and in further view of LUO explicitly teach the method of claim 1, ZENG in view of ZHANG fail to explicitly teach wherein the first pretrained dense neural sub-network comprises fully connected layers that successively reduce a dimensionality of an input.  
However, WARD explicitly teaches wherein the first pretrained dense neural sub-network (Fig. 4. Paragraph [0042]-WARD discloses speech recognition system 200 comprises front-end module 201, convolutional neural network (CNN) stack 202, first fully-connected layer 203, recurrent neural network (RNN) stack 204, second fully-connected layer 205, output neural network stack 206, and optional customization layer 207 (wherein a word embedding is predicted from audio input). In paragraph [0061]-WARD discloses FIG. 4 illustrates an example CNN stack architecture. Segments of frames are processed by one or more convolutional and pooling neural network layers that make up a convolutional neural network stack such as CNN stack 202. In paragraph [0062]-WARD discloses first fully-connected layer 203 receives features from CNN stack 202 and produces a second set of features. A fully-connected neural network is a neural network in which all nodes in a layer of the neural network are connected to all nodes of the subsequent layer of the neural network. In paragraph [0068]-WARD discloses Recurrent Neural Network (RNN) stack 204 receives these features from first fully-connected stack 203 and produces a third set of features) comprises fully connected layers that successively reduce a dimensionality of an input (Fig. 4. Paragraph [0065]-WARD discloses first fully-connected layer 203 may resize the output of CNN stack 202 for consumption by the subsequent stack. CNN stack 202 may produce a high dimensioned output based on the number of feature maps used and the frequency context of the output. The first fully-connected layer 203 may reduce the dimension of this output to reduce the number of parameters subsequent stack need to process. In paragraph [0074]-WARD discloses a second fully-connected stack 205 receives the output features from RNN stack 204 and produces a word embedding. Second fully-connected stack 205 reduces the dimensionality of the output of RNN stack 204 to something more concise. Second fully-connected stack 205 produces a word embedding of significantly reduced dimension compared to the output of RNN stack 204. This word embedding contains information related to the word predicted for a given time frame, and also information regarding words around the predicted word). Please also read paragraph [0057-0058]). 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to combine the teachings of ZENG in view of ZHANG and in further view of LOU of having a method comprising: determining a reduced dimensionality dense image embedding from a dense image embedding using a first pretrained dense neural sub-network, the dense image embedding encapsulating features of a digital image associated with a multimodal content, with the teachings of WARD having wherein the second pretrained dense neural sub-network comprises fully connected layers that successively reduce a dimensionality of an input.  
Wherein having ZENG’s method having wherein the second pretrained dense neural sub-network comprises fully connected layers that successively reduce a dimensionality of an input.  
The motivation behind the modification would have been to obtain a method that improves data processing and predictions using neural network models, since both ZENG and WARD concern embeddings and neural networks. Wherein ZENG provides systems and methods that encodes data and generates embeddings using transformer networks to improve entity extraction and modeling predictions, which improves overall document processing workflow, while WARD provides systems and methods that improve word prediction using embeddings and neural network models. Please see ZENG et al. (US 20230351115 A1), Abstract and Paragraph [0035-0038] and WARD et al. (US 20200035219 A1), Abstract and Paragraph [0005-0009].



Regarding claim 5, ZENG in view of ZHANG and in further view of LUO explicitly teach the method of claim 1, ZENG in view of ZHANG fail to explicitly teach wherein the second pretrained dense neural sub-network comprises fully connected layers that successively reduce a dimensionality of an input.  
However, WARD explicitly teaches wherein the second pretrained dense neural sub-network (Fig. 4. Paragraph [0042]-WARD discloses speech recognition system 200 comprises front-end module 201, convolutional neural network (CNN) stack 202, first fully-connected layer 203, recurrent neural network (RNN) stack 204, second fully-connected layer 205, output neural network stack 206, and optional customization layer 207 (wherein a word embedding is predicted from audio input). In paragraph [0061]-WARD discloses FIG. 4 illustrates an example CNN stack architecture. Segments of frames are processed by one or more convolutional and pooling neural network layers that make up a convolutional neural network stack such as CNN stack 202. In paragraph [0062]-WARD discloses first fully-connected layer 203 receives features from CNN stack 202 and produces a second set of features. A fully-connected neural network is a neural network in which all nodes in a layer of the neural network are connected to all nodes of the subsequent layer of the neural network. In paragraph [0068]-WARD discloses Recurrent Neural Network (RNN) stack 204 receives these features from first fully-connected stack 203 and produces a third set of features) comprises fully connected layers that successively reduce a dimensionality of an input (Fig. 4. Paragraph [0065]-WARD discloses first fully-connected layer 203 may resize the output of CNN stack 202 for consumption by the subsequent stack. CNN stack 202 may produce a high dimensioned output based on the number of feature maps used and the frequency context of the output. The first fully-connected layer 203 may reduce the dimension of this output to reduce the number of parameters subsequent stack need to process. In paragraph [0074]-WARD discloses a second fully-connected stack 205 receives the output features from RNN stack 204 and produces a word embedding. Second fully-connected stack 205 reduces the dimensionality of the output of RNN stack 204 to something more concise. Second fully-connected stack 205 produces a word embedding of significantly reduced dimension compared to the output of RNN stack 204. This word embedding contains information related to the word predicted for a given time frame, and also information regarding words around the predicted word). Please also read paragraph [0057-0058]).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to combine the teachings of ZENG in view of ZHANG and in further view of LOU of having a method comprising: determining a reduced dimensionality dense image embedding from a dense image embedding using a first pretrained dense neural sub-network, the dense image embedding encapsulating features of a digital image associated with a multimodal content, with the teachings of WARD having wherein the second pretrained dense neural sub-network comprises fully connected layers that successively reduce a dimensionality of an input.  
Wherein having ZENG’s method having wherein the second pretrained dense neural sub-network comprises fully connected layers that successively reduce a dimensionality of an input.  
The motivation behind the modification would have been to obtain a method that improves data processing and predictions using neural network models, since both ZENG and WARD concern embeddings and neural networks. Wherein ZENG provides systems and methods that encodes data and generates embeddings using transformer networks to improve entity extraction and modeling predictions, which improves overall document processing workflow, while WARD provides systems and methods that improve word prediction using embeddings and neural network models. Please see ZENG et al. (US 20230351115 A1), Abstract and Paragraph [0035-0038] and WARD et al. (US 20200035219 A1), Abstract and Paragraph [0005-0009].

Regarding claim 12, ZENG in view of ZHANG and in further view of LUO explicitly teach the system of claim 9, ZENG in view of ZHANG fail to explicitly teach wherein the first pretrained dense neural sub-network comprises fully connected layers that successively reduce a dimensionality of an input.
However, WARD explicitly teaches wherein the first pretrained dense neural sub-network (Fig. 4. Paragraph [0042]-WARD discloses speech recognition system 200 comprises front-end module 201, convolutional neural network (CNN) stack 202, first fully-connected layer 203, recurrent neural network (RNN) stack 204, second fully-connected layer 205, output neural network stack 206, and optional customization layer 207 (wherein a word embedding is predicted from audio input). In paragraph [0061]-WARD discloses FIG. 4 illustrates an example CNN stack architecture. Segments of frames are processed by one or more convolutional and pooling neural network layers that make up a convolutional neural network stack such as CNN stack 202. In paragraph [0062]-WARD discloses first fully-connected layer 203 receives features from CNN stack 202 and produces a second set of features. A fully-connected neural network is a neural network in which all nodes in a layer of the neural network are connected to all nodes of the subsequent layer of the neural network. In paragraph [0068]-WARD discloses Recurrent Neural Network (RNN) stack 204 receives these features from first fully-connected stack 203 and produces a third set of features) comprises fully connected layers that successively reduce a dimensionality of an input (Fig. 4. Paragraph [0065]-WARD discloses first fully-connected layer 203 may resize the output of CNN stack 202 for consumption by the subsequent stack. CNN stack 202 may produce a high dimensioned output based on the number of feature maps used and the frequency context of the output. The first fully-connected layer 203 may reduce the dimension of this output to reduce the number of parameters subsequent stack need to process. In paragraph [0074]-WARD discloses a second fully-connected stack 205 receives the output features from RNN stack 204 and produces a word embedding. Second fully-connected stack 205 reduces the dimensionality of the output of RNN stack 204 to something more concise. Second fully-connected stack 205 produces a word embedding of significantly reduced dimension compared to the output of RNN stack 204. This word embedding contains information related to the word predicted for a given time frame, and also information regarding words around the predicted word). Please also read paragraph [0057-0058]).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to combine the teachings of ZENG in view of ZHANG and in further view of LOU and in further view of STANLEY of having a system comprising: at least one processor; memory storing instructions to be executed by the at least one processor, the instructions for: in a machine learning pipeline stored in the memory and executed by the at least one processor: determining, by a first pretrained dense neural sub-network of the machine learning pipeline, with the teachings of WARD having extracting hierarchical features from the digital image by applying convolutional operations in parallel to capture the hierarchical features at different scales; passing feature maps through fully connected layers.
Wherein having ZENG’s system having extracting hierarchical features from the digital image by applying convolutional operations in parallel to capture the hierarchical features at different scales; passing feature maps through fully connected layers.
The motivation behind the modification would have been to obtain a system that improves data processing and predictions using neural network models, since both ZENG and WARD concern embeddings and neural networks. Wherein ZENG provides systems and methods that encodes data and generates embeddings using transformer networks to improve entity extraction and modeling predictions, which improves overall document processing workflow, while WARD provides systems and methods that improve word prediction using embeddings and neural network models. Please see ZENG et al. (US 20230351115 A1), Abstract and Paragraph [0035-0038] and WARD et al. (US 20200035219 A1), Abstract and Paragraph [0005-0009].

Regarding claim 13, ZENG in view of ZHANG and in further view of LUO explicitly teach the system of claim 9, ZENG in view of ZHANG fail to explicitly teach wherein the second pretrained dense neural sub-network comprises fully connected layers that successively reduce a dimensionality of an input.
However, WARD explicitly teaches wherein the second pretrained dense neural sub-network (Fig. 4. Paragraph [0042]-WARD discloses speech recognition system 200 comprises front-end module 201, convolutional neural network (CNN) stack 202, first fully-connected layer 203, recurrent neural network (RNN) stack 204, second fully-connected layer 205, output neural network stack 206, and optional customization layer 207 (wherein a word embedding is predicted from audio input). In paragraph [0061]-WARD discloses FIG. 4 illustrates an example CNN stack architecture. Segments of frames are processed by one or more convolutional and pooling neural network layers that make up a convolutional neural network stack such as CNN stack 202. In paragraph [0062]-WARD discloses first fully-connected layer 203 receives features from CNN stack 202 and produces a second set of features. A fully-connected neural network is a neural network in which all nodes in a layer of the neural network are connected to all nodes of the subsequent layer of the neural network. In paragraph [0068]-WARD discloses Recurrent Neural Network (RNN) stack 204 receives these features from first fully-connected stack 203 and produces a third set of features) comprises fully connected layers that successively reduce a dimensionality of an input (Fig. 4. Paragraph [0065]-WARD discloses first fully-connected layer 203 may resize the output of CNN stack 202 for consumption by the subsequent stack. CNN stack 202 may produce a high dimensioned output based on the number of feature maps used and the frequency context of the output. The first fully-connected layer 203 may reduce the dimension of this output to reduce the number of parameters subsequent stack need to process. In paragraph [0074]-WARD discloses a second fully-connected stack 205 receives the output features from RNN stack 204 and produces a word embedding. Second fully-connected stack 205 reduces the dimensionality of the output of RNN stack 204 to something more concise. Second fully-connected stack 205 produces a word embedding of significantly reduced dimension compared to the output of RNN stack 204. This word embedding contains information related to the word predicted for a given time frame, and also information regarding words around the predicted word). Please also read paragraph [0057-0058]). 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to combine the teachings of ZENG in view of ZHANG and in further view of LOU and in further view of STANLEY of having a system comprising: at least one processor; memory storing instructions to be executed by the at least one processor, the instructions for: in a machine learning pipeline stored in the memory and executed by the at least one processor: determining, by a first pretrained dense neural sub-network of the machine learning pipeline, with the teachings of WARD having extracting hierarchical features from the digital image by applying convolutional operations in parallel to capture the hierarchical features at different scales; passing feature maps through fully connected layers.
Wherein having ZENG’s system having extracting hierarchical features from the digital image by applying convolutional operations in parallel to capture the hierarchical features at different scales; passing feature maps through fully connected layers.
The motivation behind the modification would have been to obtain a system that improves data processing and predictions using neural network models, since both ZENG and WARD concern embeddings and neural networks. Wherein ZENG provides systems and methods that encodes data and generates embeddings using transformer networks to improve entity extraction and modeling predictions, which improves overall document processing workflow, while WARD provides systems and methods that improve word prediction using embeddings and neural network models. Please see ZENG et al. (US 20230351115 A1), Abstract and Paragraph [0035-0038] and WARD et al. (US 20200035219 A1), Abstract and Paragraph [0005-0009].

Regarding claim 19, ZENG in view of ZHANG and in further view of LUO explicitly teach the non-transitory computer-readable medium of claim 16, ZENG in view of ZHANG fail to explicitly teach wherein the first pretrained dense neural sub-network comprises fully connected layers that successively reduce a dimensionality of an input.
However, WARD explicitly teaches wherein the first pretrained dense neural sub-network (Fig. 4. Paragraph [0042]-WARD discloses speech recognition system 200 comprises front-end module 201, convolutional neural network (CNN) stack 202, first fully-connected layer 203, recurrent neural network (RNN) stack 204, second fully-connected layer 205, output neural network stack 206, and optional customization layer 207 (wherein a word embedding is predicted from audio input). In paragraph [0061]-WARD discloses FIG. 4 illustrates an example CNN stack architecture. Segments of frames are processed by one or more convolutional and pooling neural network layers that make up a convolutional neural network stack such as CNN stack 202. In paragraph [0062]-WARD discloses first fully-connected layer 203 receives features from CNN stack 202 and produces a second set of features. A fully-connected neural network is a neural network in which all nodes in a layer of the neural network are connected to all nodes of the subsequent layer of the neural network. In paragraph [0068]-WARD discloses Recurrent Neural Network (RNN) stack 204 receives these features from first fully-connected stack 203 and produces a third set of features) comprises fully connected layers that successively reduce a dimensionality of an input (Fig. 4. Paragraph [0065]-WARD discloses first fully-connected layer 203 may resize the output of CNN stack 202 for consumption by the subsequent stack. CNN stack 202 may produce a high dimensioned output based on the number of feature maps used and the frequency context of the output. The first fully-connected layer 203 may reduce the dimension of this output to reduce the number of parameters subsequent stack need to process. In paragraph [0074]-WARD discloses a second fully-connected stack 205 receives the output features from RNN stack 204 and produces a word embedding. Second fully-connected stack 205 reduces the dimensionality of the output of RNN stack 204 to something more concise. Second fully-connected stack 205 produces a word embedding of significantly reduced dimension compared to the output of RNN stack 204. This word embedding contains information related to the word predicted for a given time frame, and also information regarding words around the predicted word). Please also read paragraph [0057-0058]).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to combine the teachings of ZENG in view of ZHANG and in further view of LUO of having a non-transitory computer-readable medium storing instructions which, when executed by at least one programmable electronic device, cause the at least one programmable electronic device to perform operations comprising: determining a reduced dimensionality dense image embedding from a dense image embedding using a first pretrained dense neural sub-network, with the teachings of WARD having determining a numerical score of a multimodal content using a third dense neural sub-network.
Wherein having ZENG’s non-transitory computer-readable medium having determining a numerical score of a multimodal content using a third dense neural sub-network, the reduced dimensionality dense image embedding and the reduced dimensionality contextual text embedding.
The motivation behind the modification would have been to obtain a non-transitory computer-readable medium that improves data processing and predictions using neural network models, since both ZENG and WARD concern embeddings and neural networks. Wherein ZENG provides systems and methods that encodes data and generates embeddings using transformer networks to improve entity extraction and modeling predictions, which improves overall document processing workflow, while WARD provides systems and methods that improve word prediction using embeddings and neural network models. Please see ZENG et al. (US 20230351115 A1), Abstract and Paragraph [0035-0038] and WARD et al. (US 20200035219 A1), Abstract and Paragraph [0005-0009].

Regarding claim 20, ZENG in view of ZHANG and in further view of LUO explicitly teach the non-transitory computer-readable medium of claim 16, ZENG in view of ZHANG fail to explicitly teach wherein the second pretrained dense neural sub-network comprises fully connected layers that successively reduce a dimensionality of an input.
However, WARD explicitly teaches wherein the second pretrained dense neural sub-network (Fig. 4. Paragraph [0042]-WARD discloses speech recognition system 200 comprises front-end module 201, convolutional neural network (CNN) stack 202, first fully-connected layer 203, recurrent neural network (RNN) stack 204, second fully-connected layer 205, output neural network stack 206, and optional customization layer 207 (wherein a word embedding is predicted from audio input). In paragraph [0061]-WARD discloses FIG. 4 illustrates an example CNN stack architecture. Segments of frames are processed by one or more convolutional and pooling neural network layers that make up a convolutional neural network stack such as CNN stack 202. In paragraph [0062]-WARD discloses first fully-connected layer 203 receives features from CNN stack 202 and produces a second set of features. A fully-connected neural network is a neural network in which all nodes in a layer of the neural network are connected to all nodes of the subsequent layer of the neural network. In paragraph [0068]-WARD discloses Recurrent Neural Network (RNN) stack 204 receives these features from first fully-connected stack 203 and produces a third set of features) comprises fully connected layers that successively reduce a dimensionality of an input (Fig. 4. Paragraph [0065]-WARD discloses first fully-connected layer 203 may resize the output of CNN stack 202 for consumption by the subsequent stack. CNN stack 202 may produce a high dimensioned output based on the number of feature maps used and the frequency context of the output. The first fully-connected layer 203 may reduce the dimension of this output to reduce the number of parameters subsequent stack need to process. In paragraph [0074]-WARD discloses a second fully-connected stack 205 receives the output features from RNN stack 204 and produces a word embedding. Second fully-connected stack 205 reduces the dimensionality of the output of RNN stack 204 to something more concise. Second fully-connected stack 205 produces a word embedding of significantly reduced dimension compared to the output of RNN stack 204. This word embedding contains information related to the word predicted for a given time frame, and also information regarding words around the predicted word). Please also read paragraph [0057-0058 and 0106]).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to combine the teachings of ZENG in view of ZHANG and in further view of LUO of having a non-transitory computer-readable medium storing instructions which, when executed by at least one programmable electronic device, cause the at least one programmable electronic device to perform operations comprising: determining a reduced dimensionality dense image embedding from a dense image embedding using a first pretrained dense neural sub-network, with the teachings of WARD having determining a numerical score of a multimodal content using a third dense neural sub-network.
Wherein having ZENG’s non-transitory computer-readable medium having determining a numerical score of a multimodal content using a third dense neural sub-network, the reduced dimensionality dense image embedding and the reduced dimensionality contextual text embedding.
The motivation behind the modification would have been to obtain a non-transitory computer-readable medium that improves data processing and predictions using neural network models, since both ZENG and WARD concern embeddings and neural networks. Wherein ZENG provides systems and methods that encodes data and generates embeddings using transformer networks to improve entity extraction and modeling predictions, which improves overall document processing workflow, while WARD provides systems and methods that improve word prediction using embeddings and neural network models. Please see ZENG et al. (US 20230351115 A1), Abstract and Paragraph [0035-0038] and WARD et al. (US 20200035219 A1), Abstract and Paragraph [0005-0009].

Claims 7 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over ZENG et al. (US 20230351115 A1), hereinafter referenced as ZENG in view of ZHANG et al. (US 20220237480 A1), hereinafter referenced as ZHANG and in further view of LUO et al. (US 20240184835 A1), hereinafter referenced as LUO and in further view of MURIQI (US 20240046318 A1), hereinafter referenced as MURIQI.

Regarding claim 7, ZENG in view of ZHANG and in further view of LUO explicitly teach the method of claim 1, ZENG in view of ZHANG fail to explicitly teach further comprising: causing the multimodal content to be presented to a social media feed based on a ranking of the multimodal content 
However, MURIQI explicitly teaches further comprising: causing the multimodal content (Paragraph [0087]-MURIQI discloses the social network record may comprise a history of user interaction with the content, further comprising debiting the account associated with the user for user interaction with the content. The method may further comprise receiving a subjective assessment or comment, wherein the subjective assessment or comment is linked to the social network record, and crediting or debiting the account associated with the user for the receipt of the subjective assessment or comment. The method may further comprise crediting or debiting the account associated with the user for the subjective assessment or comment, based on interaction of other users with the subjective assessment or comment. The method may further comprise crediting the account associated with the proprietor of the social network for the for at least one of the presentation of the communication and the action predicated on the communication. The method may further comprise crediting at least one of the account associated with the user, the account associated a proprietor of the content, and the account of a proprietor of the social network user for a presentation of the communication to the user. The method may further comprise verifying a presentation of the communication to the user. The method may further comprise capturing images of the user with a camera during the presentation of the communication; and verifying presentation of the communication to the user based on the captured images)to be presented to a social media feed based on a ranking of the multimodal content (Paragraph [0087]-MURIQI discloses the method may further comprise providing an automated recommender; generating the proposal, referral, or recommendation of content with the automated recommender; and selecting or ranking the content for presentation to the user. The method may further comprise communicating with a generative pre-trained transformer comprising a large language model, which processes social network records and generates the proposal, referral, or recommendation of the content. Further in paragraph [0166]-MURIQI discloses an algorithm is provided to optimize advertising. The inputs are high-dimensional vectors and are sparse. Therefore, transforming them into low-dimensional dense representations is crucial to reduce dimensional complexity. The embedding layer reduces dimensional complexity and contextualizes the vectors. All outputs of embedding vectors are concatenated and fed into the following fully connected layers, except for users' behavior sequences. Users' behavior sequence features are fed into the individual user interest extractor layer. Segment interests are used to improve model performance for predicting advertisement click-through rates as well as individual user interests).  
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to combine the teachings of ZENG in view of ZHANG and in further view of LUO of having a method comprising: determining a reduced dimensionality dense image embedding from a dense image embedding using a first pretrained dense neural sub-network, the dense image embedding encapsulating features of a digital image associated with a multimodal content, with the teachings of MURIQI having determining a numerical score of a multimodal content using a third dense neural sub-network.
Wherein having ZENG’s method having determining a numerical score of a multimodal content using a third dense neural sub-network, the reduced dimensionality dense image embedding and the reduced dimensionality contextual text embedding.
The motivation behind the modification would have been to obtain a method that improves data processing and predictions using transformer models, since both ZENG and MURIQI concern systems and methods for processing data to generate predictions/recommendations using transformer neural networks. Wherein ZENG provides systems and methods that encodes data and generates embeddings using transformer networks to improve entity extraction and modeling predictions, which improves overall document processing workflow, while MURIQI provides systems and methods uses embeddings and a multimodal transformer neural network to improve ranking processes, advertisement creation and prediction, user experiences and long-term business key performance indicators. Please see ZENG et al. (US 20230351115 A1), Abstract and Paragraph [0035-0038] and MURIQI (US 20240046318 A1), Abstract and Paragraph [0087 and 0158-0169].

Regarding claim 14, ZENG in view of ZHANG and in further view of LUO explicitly teach the system of claim 9, ZENG in view of ZHANG fail to explicitly teach wherein the action taken comprises presenting the content in a social media feed of a user.  
However, MURIQI explicitly teaches wherein the action taken (Paragraph [0087]-MURIQI discloses the social network record may comprise a history of user interaction with the content, further comprising debiting the account associated with the user for user interaction with the content. The method may further comprise receiving a subjective assessment or comment, wherein the subjective assessment or comment is linked to the social network record, and crediting or debiting the account associated with the user for the receipt of the subjective assessment or comment. The method may further comprise crediting or debiting the account associated with the user for the subjective assessment or comment, based on interaction of other users with the subjective assessment or comment. The method may further comprise crediting the account associated with the proprietor of the social network for the for at least one of the presentation of the communication and the action predicated on the communication. The method may further comprise crediting at least one of the account associated with the user, the account associated a proprietor of the content, and the account of a proprietor of the social network user for a presentation of the communication to the user. The method may further comprise verifying a presentation of the communication to the user. The method may further comprise capturing images of the user with a camera during the presentation of the communication; and verifying presentation of the communication to the user based on the captured images) comprises presenting the content in a social media feed of a user (Paragraph [0087]-MURIQI discloses the method may further comprise providing an automated recommender; generating the proposal, referral, or recommendation of content with the automated recommender; and selecting or ranking the content for presentation to the user. The method may further comprise communicating with a generative pre-trained transformer comprising a large language model, which processes social network records and generates the proposal, referral, or recommendation of the content. Further in paragraph [0166]-MURIQI discloses an algorithm is provided to optimize advertising. The inputs are high-dimensional vectors and are sparse. Therefore, transforming them into low-dimensional dense representations is crucial to reduce dimensional complexity. The embedding layer reduces dimensional complexity and contextualizes the vectors. All outputs of embedding vectors are concatenated and fed into the following fully connected layers, except for users' behavior sequences. Users' behavior sequence features are fed into the individual user interest extractor layer. Segment interests are used to improve model performance for predicting advertisement click-through rates as well as individual user interests).  
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to combine the teachings of ZENG in view of ZHANG and in further view of LUO of having a system comprising: at least one processor; memory storing instructions to be executed by the at least one processor, the instructions for: in a machine learning pipeline stored in the memory and executed by the at least one processor: determining, by a first pretrained dense neural sub-network of the machine learning pipeline, with the teachings of MURIQI having wherein the action taken comprises presenting the content in a social media feed of a user.  
Wherein having ZENG’s system having wherein the action taken comprises presenting the content in a social media feed of a user.  
The motivation behind the modification would have been to obtain a system that improves data processing and predictions using transformer models, since both ZENG and MURIQI concern systems and methods for processing data to generate predictions/recommendations using transformer neural networks. Wherein ZENG provides systems and methods that encodes data and generates embeddings using transformer networks to improve entity extraction and modeling predictions, which improves overall document processing workflow, while MURIQI provides systems and methods uses embeddings and a multimodal transformer neural network to improve ranking processes, advertisement creation and prediction, user experiences and long-term business key performance indicators. Please see ZENG et al. (US 20230351115 A1), Abstract and Paragraph [0035-0038] and MURIQI (US 20240046318 A1), Abstract and Paragraph [0087 and 0158-0169].


Conclusion
Listed below are the prior arts made of record and not relied upon but are considered pertinent to applicant`s disclosure.
Tennenholtz (US 20250111157 A1)-Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for analyzing embedding spaces using large language models. In one aspect, a method performed by one or more computers for analyzing a target embedding space using a neural network configured to perform a set of machine learning tasks is described. The method includes: obtaining, for each of one or more entities, a respective domain embedding representing the entity in the target embedding space; receiving a text prompt including a sequence of input tokens describing a particular machine learning task in the set to be performed on the one or more entities; preparing, for the neural network, an input sequence including each input token in the text prompt and each domain embedding; and processing the input sequence, using the neural network, to generate a sequence of output tokens describing a result of the particular machine learning task..................... Provisional Filing date of 09/28/2023. Please see Fig. 1-2. Abstract. Please note the 
Liu (US 20240330310 A1)- A system is provided for reranking. The system comprises a user device and one or more servers. The system is configured to receive a plurality of candidate lists, rerank the plurality of candidate lists based on page-level information and a format of a recommendation page, generate recommendation results based on the reranked lists, and send the recommendation results to the user device. Each candidate list comprises a plurality of candidate items. The page-level information comprises interactions between the candidate items in each candidate list and between different candidate lists among the plurality of candidate lists. The reranking comprises using the format of the recommendation page to determine pairwise item influences between candidate item pairs among the candidate items in the candidate lists. The user device is configured to display the recommendation page with the recommendation results from the one or more servers...................... Please see Fig. 1-3. Abstract. 
HUANG (US 20240282107 A1)- The present technology pertains to a multi-modal transformer model that is designed and trained to perform cross-modal tasks such as image-text matching, wherein the model is further refined with data for the particular downstream use case of the model. More specifically, the present technology can refine the underlying model with labeled examples derived from a dataset of text-image pairs that ultimately achieved a desired interaction in the proper context. For example, in the use case of advertising applications in an App store, the present technology can refine the underlying model with examples of images used to advertise applications in the App store where the respective invitational content was clicked or converted......................Please see Fig. 1-8. Abstract.
Kharbanda (US 20240403362 A1)- A multimodal search system using a video query is described. The system can receive video data captured by a camera of a user device. The video data can have a sequence of image frames. Additionally, the system can receive audio data associated with the video data captured by the user device. Moreover, the system can process, using one or more machine-learned models, the sequence of image frames to generate video embeddings related to the sequence of the image frames. The video embeddings can have a plurality of image embeddings associated with the sequence of image frames. Furthermore, the system can determine one or more video results based on the video embeddings and the audio data. Subsequently, the system can transmit, to the user device, the one or more video results......................Please see Fig. 1-3 and 5-6. Abstract.
MASCHMEYER (US 20250165125 A1)- A method and apparatus is provided to allow a user to explore an n-dimensional embedding space using a recommender system, including a navigational UI. A set of n-dimensional embeddings from an n-dimensional embedding space may be transformed into a set of lower dimensional embeddings, based on a dimensionality reduction. The set of lower dimensional embeddings may be processed to generate a configuration of spaced items, and a signal may be transmitted to cause a display of a remote user device to output the navigational user interface (UI) having a plurality of selectable items according to the configuration of spaced items, the plurality of selectable items corresponding to lower dimensional embeddings of the set of lower dimensional embeddings. The disclosed method and apparatus may enable improved user interaction with an e-commerce website while browsing through dense product spaces....................Please see Fig. 1-4 and 6. Abstract.

Any inquiry concerning this communication or earlier communications from the examiner 
should be directed to Aaron Bonansinga whose telephone number is (703) 756-5380 The examiner can normally be reached on Monday-Friday, 9:00 a.m. - 6:00 p.m. ET.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s
supervisor, Chineyere Wills-Burns can be reached by phone at (571) 272-9752. The fax phone number for the organization where this application or proceeding is assigned is (571) 273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. 
/AARON TIMOTHY BONANSINGA/Examiner, Art Unit 2673                                                                                                                                                                                                        
/CHINEYERE WILLS-BURNS/Supervisory Patent Examiner, Art Unit 2673
Read full office action
Prosecution Timeline

Dec 18, 2023
Application Filed
Feb 19, 2026
Non-Final Rejection — §103
Mar 30, 2026
Interview Requested
Apr 08, 2026
Applicant Interview (Telephonic)
Apr 10, 2026
Examiner Interview Summary
Precedent Cases

Applications granted by this same examiner with similar technology

18/034,456
Patent 12555249
METHOD, SYSTEM, AND NON-TRANSITORY COMPUTER READABLE STORAGE MEDIUM FOR SUPPORTING VIRTUAL GOLF SIMULATION
2y 5m to grant Granted Feb 17, 2026
18/175,955
Patent 12548171
INFORMATION PROCESSING APPARATUS, METHOD AND MEDIUM
2y 5m to grant Granted Feb 10, 2026
18/004,141
Patent 12541822
METHOD AND APPARATUS OF PROCESSING IMAGE, COMPUTING DEVICE, AND MEDIUM
2y 5m to grant Granted Feb 03, 2026
17/999,276
Patent 12505503
IMAGE ENHANCEMENT
2y 5m to grant Granted Dec 23, 2025
17/983,119
Patent 12482106
METHOD AND ELECTRONIC DEVICE FOR SEGMENTING OBJECTS IN SCENE
2y 5m to grant Granted Nov 25, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
76%
Grant Probability
99%
With Interview (+33.3%)
2y 11m
Median Time to Grant
Low
PTA Risk
Based on 25 resolved cases by this examiner. Grant probability derived from career allow rate.