Last updated: April 19, 2026
Application No. 17/649,091
VIDEO RECOMMENDER SYSTEM BY KNOWLEDGE BASED MULTI-MODAL GRAPH NEURAL NETWORKS

Final Rejection §101§103
Filed
Jan 27, 2022
Examiner
YI, HYUNGJUN B
Art Unit
2146
Tech Center
2100 — Computer Architecture & Software
Assignee
Adobe Inc.
OA Round
2 (Final)
Interview Optional

— +31.7% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 17 resolved cases, 2023–2026
Examiner Intelligence

YI, HYUNGJUN B View full profile →
Grants only 18% of cases
Career Allow Rate
3 granted / 17 resolved
-37.4% vs TC avg
Strong +32% interview lift
Without
With
+31.7%
Interview Lift
resolved cases with interview
Typical timeline
4y 7m
Avg Prosecution
39 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
26.3%
-13.7% vs TC avg
§103
53.9%
+13.9% vs TC avg
§102
12.9%
-27.1% vs TC avg
§112
4.7%
-35.3% vs TC avg
Black line = Tech Center average estimate • Based on career data from 17 resolved cases
Office Action

§101 §103
DETAILED ACTION
This action is responsive to the claims filed on 09/04/2025. Claims 1-20 are pending for examination.  This action is Final.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
Applicant’s arguments regarding 35 USC § 103 with respect to the claims have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Applicant’s arguments regarding 35 USC § 101 with respect to the claims (Remarks, pages 7-9) have been considered but the examiner respectfully disagrees. Applicant’s arguments are not persuasive because, as amended, the claims still recite an abstract idea under Step 2A, Prong One in the form of (i) mental processes (e.g., generating and evaluating recommendation information such as relationships and similarity-based comparisons/recommendations that can reasonably be performed in the human mind or with pen and paper) and (ii) mathematical concepts (e.g., generating feature embeddings and attention-based computations using vectors/matrices, including query/key/value operations and an encoding matrix representing the knowledge graph), which are expressly described in the Specification as mathematical operations on real-number representations/matrices (see Spec. [0062]–[0071], e.g., matrix-based feature descriptions, dot products, similarity scores/cosine similarity, and loss functions). Further, under Step 2A, Prong Two, the additional limitations merely apply these abstract ideas in the context of item recommendation using generic computing/ML functionality and do not integrate the exception into a practical application (i.e., the claim does not recite a specific technical solution or a tangible technological improvement beyond performing the recited mathematical/algorithmic processing to output a recommendation). Applicant’s asserted “improvements” (e.g., scalability/cold-start/sparse data) appear in the specification as intended benefits, but are not reflected as a concrete improvement or technical effect in the claim language itself. Accordingly, the claims are not “significantly more” than the judicial exception, and the rejection under 35 U.S.C. § 101 is maintained.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Statutory Categories
Claims 1-10 are directed to a method.
Claim 11-13 is directed to a method.
Claim 14-20 is directed to an apparatus.

Independent Claims – Claims 1 and 11
Step 2A Prong 1: Does the claim recite an abstract idea, law of nature, or natural phenomenon?
Yes. Independent claims 1 and 11 recites limitations that are abstract ideas in the form of mental processes:
Claim 1 recites:
A method for item recommendation, comprising: generating a knowledge graph based on the input, wherein the knowledge graph comprises relationship information between a node representing the user and a plurality of nodes corresponding to a plurality of content items including the first content item; (a mental process of generation that can reasonably be performed in the human mind or with aid of pen and paper.)
generating a first feature embedding representing the user and a second feature embedding representing a second content item of the plurality of content items using a multi-modal graph encoder based on the knowledge graph, (a mental process of generation using a graph encoder stated at a high level of generality which can reasonably be performed in the human mind or with aid of pen and paper)
wherein the second feature embedding is generated (a mental process of generation that can reasonably be performed in the human mind or with aid of pen and paper)
comparing the first feature embedding to the second feature embedding to obtain a similarity score between the user and the second content item; (a mental process of comparison that can reasonably be performed in the human mind or with aid of pen and paper)
and recommending the second content item for the user based on the similarity score. (a mental process of evaluation that can reasonably be performed in the human mind or with aid of pen and paper)
This claim recites the following additional elements for the purposes of Step 2A Prong Two:
receiving input indicating a relationship between a user and a first content item; () (receiving information stated at a high level of generality such that it is considered as mere data gathering or outputting and is considered insignificant extra solution activity under MPEP 2106.05(g))
using a first modality for a query vector of an attention mechanism and a second modality for a key vector and a value vector of the attention mechanism; (using first and second modality query and key without any additional modification to the knowledge graph itself is being interpreted as mere instructions to apply the exception under MPEP 2106.05(f))
wherein the attention mechanism takes an encoding matrix representing the knowledge graph as input; (receiving information stated at a high level of generality such that it is considered as mere data gathering or outputting and is considered insignificant extra solution activity under MPEP 2106.05(g))
The additional limitations fail step 2A Prong 2 of the 101 analysis because they do not transform the claim into a practical application. These limitations are too abstract or lack technical improvement that would make the concept practically useful. Without clear utility or integration into a specific field, the claim does not relate to any particular application. It does not meet the requirements of Step 2A Prong 2, as it fails to make the concept meaningfully applicable in practice.
Since the claim as a whole, looking at the additional elements individually and in combination, does not contain any other additional elements that are indicative of integration into a practical application, the claim is “directed” to an abstract idea.
This claim recites the following additional elements for the purposes of Step 2B analysis:
receiving input indicating a relationship between a user and a first content item; () (receiving information stated at a high level of generality such that it is considered as mere data gathering or outputting and is considered insignificant extra solution activity under MPEP 2106.05(g), For Step 2B, it should be noted that the courts have recognized receiving or transmitting data over a network as well-understood, routine, and conventional activity. (See MPEP 2106.05(d)(ii) and Symantec, 838 F.3d at 1321, 120 USPQ2d at 1362 (utilizing an intermediary computer to forward information);))
using a first modality for a query vector of an attention mechanism and a second modality for a key vector and a value vector of the attention mechanism; (using first and second modality query and key without any additional modification to the knowledge graph itself is being interpreted as mere instructions to apply the exception under MPEP 2106.05(f))
wherein the attention mechanism takes an encoding matrix representing the knowledge graph as input; (receiving information stated at a high level of generality such that it is considered as mere data gathering or outputting and is considered insignificant extra solution activity under MPEP 2106.05(g), For Step 2B, it should be noted that the courts have recognized receiving or transmitting data over a network as well-understood, routine, and conventional activity. (See MPEP 2106.05(d)(ii) and Symantec, 838 F.3d at 1321, 120 USPQ2d at 1362 (utilizing an intermediary computer to forward information);))
The claim also fails Step 2B of the analysis because the additional limitations do not amount to significantly more than the abstract idea itself. The additional limitations do not enhance the claim in a way that would move it beyond its abstract ideas as they minimally elaborate on the core concept without adding any inventive or technical substance. Considering the additional elements individually and in combination, and the claim as a whole, the additional elements do not provide significantly more than the abstract idea. Therefore, the claim is not patent eligible.

Claim 11 has limitations substantially identical to claim 1, therefore a similar analysis applies.
Claim 11 also has the following additional limitations for consideration:
computing a loss function based on the first feature embedding and the second feature embedding; (this limitation recites mathematical operations comprising mathematical functions, formulas, or algorithms under step 2A prong 1.)
and updating parameters of the multi-modal graph encoder based on the loss function.  (a mental process of evaluation which can reasonably be performed in the human mind or with aid of pen and paper under step 2A prong 1)
Dependents of Claims 1 and 11
The remaining dependent claims corresponding to independent claims 1 and 11 do not recite additional elements, whether considered individually or in combination, that are sufficient to integrate the judicial exception into a practical application or amount to significantly more than the judicial exception. The analysis of which is shown below:

Claim 2 recites the further limitation of:
The method of claim 1, further comprising: generating a spatial encoding matrix representing a number of hops between nodes of the knowledge graph, wherein the edge encoding matrix comprises the spatial encoding matrix. (a determination which can reasonably be performed in the human mind or with aid of pen and paper, see MPEP 2106.04(a)(2)(III));
Claim 3 recites the further limitation of:
The method of claim 1, further comprising: generating an edge encoding matrix representing edge types between nodes of the knowledge graph, wherein the edge encoding matrix comprises the edge encoding matrix.  (a determination which can reasonably be performed in the human mind or with aid of pen and paper, see MPEP 2106.04(a)(2)(III));
Claim 4 recites the further limitation of:
The method of claim 3, wherein: the edge types represent types of interactions between users and content items. (this is merely additional information for an aforementioned abstract idea of a determination which can reasonably be performed in the human mind or with aid of pen and paper, see MPEP 2106.04(a)(2)(III));
Claim 5 recites the further limitation of:
The method of claim 1, further comprising: generating a visual embedding for the second content item, wherein the query vector is generated based on the visual embedding.  (a mental process of generation which can reasonably be performed in the human mind or with aid of pen and paper, see MPEP 2106.04(a)(2)(III));
Claim 6 recites the further limitation of:
The method of claim 1, further comprising: generating a textual embedding based on the second content item, wherein the key vector is generated based on the textual embedding. (a mental process of generation which can reasonably be performed in the human mind or with aid of pen and paper, see MPEP 2106.04(a)(2)(III));
Claim 7 recites the further limitation of:
The method of claim 1, further comprising: combining the query vector of the first modality and the key vector of the second modality to obtain a combined vector; and weighting the combined vector based on the knowledge graph to obtain a weighted vector. (a mental process of generation which can reasonably be performed in the human mind or with aid of pen and paper, see MPEP 2106.04(a)(2)(III));
Claim 8 recites the further limitation of:
The method of claim 7, further comprising: combining the weighted vector with the value vector of the second modality, wherein the second feature embedding is based on the combination of the weighted vector and the value vector. (a mental process of evaluation which can reasonably be performed in the human mind or with aid of pen and paper, see MPEP 2106.04(a)(2)(III));


Claim 9 recites the further limitation of:
The method of claim 1, further comprising: generating a first symmetric feature embedding using the first modality as the query vector and the second modality as the key vector; (a mental process of generation which can reasonably be performed in the human mind or with aid of pen and paper, see MPEP 2106.04(a)(2)(III))
and generating a second symmetric feature embedding using the second modality as a symmetric query vector and the first modality as a symmetric key vector, wherein the second feature embedding is based on the first symmetric feature embedding and the second symmetric feature embedding. (a mental process of generation which can reasonably be performed in the human mind or with aid of pen and paper, see MPEP 2106.04(a)(2)(III));
Claim 10 recites the further limitation of:
The method of claim 1, further comprising: computing a cosine similarity, wherein the similarity score is based on the cosine similarity. (this limitation recites mathematical operations comprising mathematical functions, formulas, or algorithms under step 2A prong 1)
Claim 12 recites the further limitation of:
The method of claim 11, further comprising: identifying a first content item and a second content item; (a mental process of identification which can reasonably be performed in the human mind or with aid of pen and paper)
determining that a user prefers the first content item over the second content item using similarity scores for the first content item and the second content item; (a mental process of comparison which can reasonably be performed in the human mind or with aid of pen and paper)
and computing a ranking loss based on the determination, wherein the loss function includes the ranking loss.  (this limitation recites mathematical operations comprising mathematical functions, formulas, or algorithms under step 2A prong 1)
Claim 13 recites the further limitation of:
The method of claim 11, further comprising: identifying a positive sample pair comprising a user and a first content item that is preferred by the user; (a mental process of identification which can reasonably be performed in the human mind or with aid of pen and paper)
identifying a negative sample pair comprising the user and a second content item that is not preferred by the user; (a mental process of identification which can reasonably be performed in the human mind or with aid of pen and paper)
and computing a contrastive learning loss based on the positive sample pair and the negative sample pair, wherein the loss function includes the contrastive learning loss. (this limitation recites mathematical operations comprising mathematical functions, formulas, or algorithms under step 2A prong 1)
Independent Claim – Claim 14
Step 2A Prong 1: Does the claim recite an abstract idea, law of nature, or natural phenomenon?
Yes. Independent claim 14 recites limitations that are abstract ideas in the form of mental processes:
Claim 14 recites:
generate a knowledge graph representing relationships between a plurality of users and a plurality of content items; (a mental process of generation that can reasonably be performed in the human mind or with aid of pen and paper.)
to generate a first feature embedding representing a user and a second feature embedding representing a content item of the plurality of content items based on the knowledge graph, (a mental process of generation using a graph encoder stated at a high level of generality which can reasonably be performed in the human mind or with aid of pen and paper)
wherein the second feature embedding is generated using a first modality for a query vector of an attention mechanism and a second modality for a key vector and a value vector of the attention mechanism; (a mental process of generation using a attention mechanism stated at a high level of generality which can reasonably be performed in the human mind or with aid of pen and paper)
compare the first feature embedding to the second feature embedding to obtain similarity scores between the users and the content items and to identify recommended content items to the users based on the similarity scores.  (a mental process of comparison or identification that can reasonably be performed in the human mind or with aid of pen and paper)
This claim recites the following additional elements for the purposes of Step 2A Prong Two:
An apparatus for item recommendation, comprising: a knowledge graph component configured to (executing on a knowledge graph without any additional modification to the knowledge graph itself is being interpreted as mere instructions to apply the exception under MPEP 2106.05(f))
and a recommendation component configured to (a component without any additional modification to the structure itself is being interpreted as mere instructions to apply the exception under MPEP 2106.05(f))
a multi-modal graph encoder configured (executing on a multi-modal graph encoder without any additional modification to the knowledge graph itself is being interpreted as mere instructions to apply the exception under MPEP 2106.05(f))
wherein the attention mechanism takes an encoding matrix representing the knowledge graph as input; (receiving information stated at a high level of generality such that it is considered as mere data gathering or outputting and is considered insignificant extra solution activity under MPEP 2106.05(g))
The additional limitations fail step 2A Prong 2 of the 101 analysis because they do not transform the claim into a practical application. These limitations are too abstract or lack technical improvement that would make the concept practically useful. Without clear utility or integration into a specific field, the claim does not relate to any particular application. It does not meet the requirements of Step 2A Prong 2, as it fails to make the concept meaningfully applicable in practice.
Since the claim as a whole, looking at the additional elements individually and in combination, does not contain any other additional elements that are indicative of integration into a practical application, the claim is “directed” to an abstract idea.
This claim recites the following additional elements for the purposes of Step 2B analysis:
An apparatus for item recommendation, comprising: a knowledge graph component configured to (executing on a knowledge graph without any additional modification to the knowledge graph itself is being interpreted as mere instructions to apply the exception under MPEP 2106.05(f))
and a recommendation component configured to (a component without any additional modification to the structure itself is being interpreted as mere instructions to apply the exception under MPEP 2106.05(f))
a multi-modal graph encoder configured (executing on a multi-modal graph encoder without any additional modification to the knowledge graph itself is being interpreted as mere instructions to apply the exception under MPEP 2106.05(f))
wherein the attention mechanism takes an encoding matrix representing the knowledge graph as input; (receiving information stated at a high level of generality such that it is considered as mere data gathering or outputting and is considered insignificant extra solution activity under MPEP 2106.05(g), furthermore it should be noted that the courts have recognized receiving or transmitting data over a network, e.g., using the Internet to gather data, Symantec, 838 F.3d at 1321, 120 USPQ2d at 1362 (utilizing an intermediary computer to forward information) as well-understood, routine, and conventional activity)
The claim also fails Step 2B of the analysis because the additional limitations do not amount to significantly more than the abstract idea itself. The additional limitations do not enhance the claim in a way that would move it beyond its abstract ideas as they minimally elaborate on the core concept without adding any inventive or technical substance. Considering the additional elements individually and in combination, and the claim as a whole, the additional elements do not provide significantly more than the abstract idea. Therefore, the claim is not patent eligible.

Dependents of Claim 14
The remaining dependent claims corresponding to independent claim 14 does not recite additional elements, whether considered individually or in combination, that are sufficient to integrate the judicial exception into a practical application or amount to significantly more than the judicial exception. The analysis of which is shown below:

Claim 15 recites the further limitation of:
The apparatus of claim 14, further comprising: an image encoder configured to (For the purposes of Step 2A Prong II and Step 2B analysis: executing on an image encoder without any additional modification to the structure of the encoder itself is being interpreted as mere instructions to apply the exception under MPEP 2106.05(f))
generate a visual embedding for the content items, wherein the query vector is generated based on the visual embedding. (this limitation recites mathematical operations comprising mathematical functions, formulas, or algorithms under step 2A prong 1);
Claim 16 recites the further limitation of:
The apparatus of claim 14, further comprising: a text encoder configured to (For the purposes of Step 2A Prong II and Step 2B analysis: executing on an text encoder without any additional modification to the structure of the encoder itself is being interpreted as mere instructions to apply the exception under MPEP 2106.05(f))
generate a textual embedding based on the content items, wherein the key vector is generated based on the textual embedding. (this limitation recites mathematical operations comprising mathematical functions, formulas, or algorithms under step 2A prong 1);
Claim 17 recites the further limitation of:
The apparatus of claim 14, further comprising: a training component configured to (For the purposes of Step 2A Prong II and Step 2B analysis: a component without any additional modification to the structure itself is being interpreted as mere instructions to apply the exception under MPEP 2106.05(f))
compute a loss function based on the first feature embedding and the second feature embedding and to update parameters of the multi-modal graph encoder based on the loss function. (this limitation recites mathematical operations comprising mathematical functions, formulas, or algorithms under step 2A prong 1);
Claim 18 recites the further limitation of:
The apparatus of claim 14, wherein: the multi-modal graph encoder comprises a symmetric bimodal attention network. (For the purposes of Step 2A Prong II and Step 2B analysis:the use of a bimodal attention network without any additional modification to the structure itself is being interpreted as mere instructions to apply the exception under MPEP 2106.05(f));
Claim 19 recites the further limitation of:
The apparatus of claim 18, wherein: the symmetric bimodal attention network comprises a first multi-head attention module corresponding to the first modality and a second multi-head attention module corresponding to the second modality. (For the purposes of Step 2A Prong II and Step 2B analysis: a component without any additional modification to the structure itself is being interpreted as mere instructions to apply the exception under MPEP 2106.05(f));
Claim 20 recites the further limitation of:
The apparatus of claim 14, further comprising: a search component configured to (For the purposes of Step 2A Prong II and Step 2B analysis :a component without any additional modification to the structure itself is being interpreted as mere instructions to apply the exception under MPEP 2106.05(f))
search for a plurality of candidate content items for recommendation to a user. (a mental process of comparison or identification that can reasonably be performed in the human mind or with aid of pen and paper);

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or non-obviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

Claims  1, 4-9, 11-12, 14-17 and 18-20 are rejected under 35 U.S.C. 103 as being unpatentable by Sun et al.,(Sun, R., Cao, X., Zhao, Y., Wan, J., Zhou, K., Zhang, F., ... & Zheng, K. (2020, October). Multi-modal knowledge graphs for recommender systems. In Proceedings of the 29th ACM international conference on information & knowledge management (pp. 1405-1414).), hereinafter referred to as Sun, in view of Lu et al., (Lu, J., Batra, D., Parikh, D., & Lee, S. (2019). Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32.), hereafter referred to as Lu, and in further view of Ying et al., (Ying, C., Cai, T., Luo, S., Zheng, S., Ke, G., He, D., ... & Liu, T. Y. (2021). Do transformers really perform badly for graph representation?. Advances in neural information processing systems, 34, 28877-28888.), hereafter referred to as Ying.

Claim 1: Sun teaches the following limitations:
A method for item recommendation, comprising: receiving input indicating a relationship between a user and a first content item; (Sun, page 1408, col. 1, paragraph 2, “Input Collaborative knowledge graph that includes the user-item bipartite graph and multi-modal knowledge graph.”, input indicating relationship between user and an item is received as a knowledge graph (KG).)
generating a knowledge graph based on the input, wherein the knowledge graph comprises relationship information between a node representing the user and a plurality of nodes corresponding to a plurality of content items including the first content item; (Sun, page 1408, col. 1, paragraph 1, “Then, CKG incorporates the user-item bipartite graph into the knowledge graph, in which each user’s behavior is represented as a triplet, (𝑒𝑢, 𝐼𝑛𝑡𝑒𝑟𝑎𝑐𝑡, 𝑒𝑖). 𝐼𝑛𝑡𝑒𝑟𝑎𝑐𝑡 = 1 means there exists an additional interact relation between 𝑒𝑢 and 𝑒𝑖 . Based on the item-entity alignment set, the user-item graph can be seamlessly integrated with knowledge graph as a unified graph.”, a collaborative knowledge graph (CKG) is generated that merges the user-item edges with the KG edges from the user-item knowledge graph used as input.)
generating a first feature embedding representing the user and a second feature embedding representing a second content item of the plurality of content items using a multi-modal graph encoder based on the knowledge graph, (Sun, page 1410, col. 2, section 4.2, paragraph 1, “Similar to the knowledge graph embedding module, the recommendation module also uses MKGs attention layer to aggregate neighbor entity information. In order to retain the 1-𝑛 hop information, we follow the setup from [28] that retains the output of the candidate user and item from the 𝑙-th layer. The output of different layers represents the information of different hops. We hence adopt the layer-aggregation mechanism[31] to concatenate the representations at each step into a single vector, which can be found as follows: 
    PNG
    media_image1.png
    28
    277
    media_image1.png
    Greyscale
”, The MKGAT encoder produces the user embedding e*u and item embedding e*i directly from the knowledge graph. Hence, two feature embeddings of user and item are generated.)
comparing the first feature embedding to the second feature embedding to obtain a similarity score between the user and the second content item; (Sun, page 1411, col. 1, paragraph 1, “Finally, we conduct inner product of user and item representations by Equation 10, so as to predict their matching score: 
    PNG
    media_image2.png
    26
    216
    media_image2.png
    Greyscale
”, The inner product compares the two embeddings and produces y(u,i), the similarity score.)
and recommending the second content item for the user based on the similarity score. (Sun, page 1408, col. 2, paragraph 2, “Recommendation Module[:] Taking knowledge graph embedding of entities (obtained by the knowledge graph embedding module) and a collaborative knowledge graph as input, the recommendation module also employ the MKGs entity encoder and MKGs attention layer to leverage corresponding neighbors to enrich the representation of users and items. Finally, the matching scores between users and items can be generated following traditional recommendation models.”, recommendation of items is produced when a matching score (similarity score) is great enough.)

    PNG
    media_image3.png
    180
    200
    media_image3.png
    Greyscale

Figure 2b of Lu
Lu, in the same field of entity encoding, teaches the following limitation which the above prior art fails to teach:
wherein the second feature embedding is generated using a first modality for a query vector of an attention mechanism and a second modality for a key vector and a value vector of the attention mechanism; (Lu, page 4, paragraph 2, “We introduce a co-attentional transformer layer shown in Fig. 2b. Given intermediate visual and linguistic representations H (i) V and H (j) W , the module computes query, key, and value matrices as in a standard transformer block. However, the keys and values from each modality are passed as input to the other modality’s multi-headed attention block. Consequentially, the attention block produces attention-pooled features for each modality conditioned on the other – in effect performing image-conditioned language attention in the visual stream and language-conditioned image attention in the linguistic stream.”, ViLBERT’s co-attentional transformer layer as shown above in figure 2b generates a feature embedding, say Hv, using a first modality for a query vector of an attention mechanism (Qv) and a second modality for a key vector and a value vector of the second modality attention mechanism (Kw/Vw).)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, to have modified the teachings of Sun by incorporating the teachings of Lu (e.g. co-attention layer mechanisms). A motivation of which would have been the advantage of merging of vision and textual modalities at varying network depths. (Lu, page 2, figure 1, ”Our ViLBERT model consists of two parallel streams for visual (green) and linguistic (purple) processing that interact through novel co-attentional transformer layers. This structure allows for variable depths for each modality and enables sparse interaction through co-attention.”)
Ying, in the same field of entity encoding, teaches the following limitation which the above prior art fails to teach:
wherein the attention mechanism takes an encoding matrix representing the knowledge graph as input; (Ying, abstract, “Our key insight to utilizing Transformer in the graph is the necessity of effectively encoding the structural information of a graph into the model.”,
Ying, page 4, section 3.1.2, paragraph 2, “We assign each (feasible) output value a learnable scalar which will serve as a bias term in the self-attention module. Denote Aij as the (i, j)-element of the Query-Key product matrix A, we have: 
    PNG
    media_image4.png
    47
    223
    media_image4.png
    Greyscale
”
Ying, page 5, paragraph 2, “The proposed edge encoding incorporates edge features via a bias term to the attention module. Concretely, we modify the (i, j)-element of A in Eq. (3) further with the edge encoding cij as: 
    PNG
    media_image5.png
    48
    431
    media_image5.png
    Greyscale
”, Ying expressly teaches encoding graph structure (i.e., structural information of a graph) and injecting that structural encoding into the self-attention module as additive terms to the attention score matrix. Specifically, Ying defines attention scores element-wise as Aij (the (i,j) element of the Query-Key product matrix A) and adds bφ(vi,vj) (a learnable scalar indexed by the node-pair structural relation φ(vi,vj), e.g., a distance-based structural relation) into Aij, i.e., directly into the attention computation. Ying further adds cij as an edge-encoding term, again injected into Aij as a bias term “to the attention module,” thereby encoding edge/relationship information into the attention mechanism. Under BRI, a “knowledge graph” is a graph of entities/nodes and relations/edges; Ying’s structural encodings φ(vi,vj) and edge encodings cij are graph-structure encodings defined over node pairs (i,j) and used to form/add to the attention score matrix A. Accordingly, these pairwise encodings constitute (or are readily representable as) an encoding matrix representing the (knowledge) graph, which is taken as an input to the attention mechanism via bias terms in the attention score matrix used to compute attention weights.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have further modified the teachings of Sun and Lu by incorporating the teachings of Ying, because Ying expressly teaches that effective use of Transformer attention on graph-structured data requires “encoding the structural information of a graph into the model” and does so by injecting graph structural encodings into the self-attention computation as additive terms to the attention score matrix Aij (e.g., adding bφ(vi,vj) and cij as bias terms in the attention module). (Ying, abstract, “necessity of effectively encoding the structural information of a graph into the model,” and the attention score definition Aij including graph-encoding bias terms.) A POSITA would have been motivated to incorporate Ying’s structural-encoding-into-attention techniques into the attention mechanisms used in Sun’s KG-based encoder (as modified by Lu’s co-attention) in order to improve how attention accounts for graph structure (e.g., hop/distance relationships and edge/relation heterogeneity) and thereby predictably obtain an attention mechanism that takes a graph-encoding matrix (i.e., matrix-form node-pair structural/edge encodings) as an input.



    PNG
    media_image6.png
    202
    325
    media_image6.png
    Greyscale

Figure 3 of Sun
Claim 4: Sun, Lu, and Ying teaches the limitations of claim 1, Sun further teaches:
The method of claim 3, wherein: the edge types represent types of interactions between users and content items. (Sun, page 1408, col. 1, paragraph 1, “Then, CKG incorporates the user-item bipartite graph into the knowledge graph, in which each user’s behavior is represented as a triplet, (𝑒𝑢, 𝐼𝑛𝑡𝑒𝑟𝑎𝑐𝑡, 𝑒𝑖). 𝐼𝑛𝑡𝑒𝑟𝑎𝑐𝑡 = 1 means there exists an additional interact relation between 𝑒𝑢 and 𝑒𝑖 .”, an interact relation is an edge type which links user nodes to item nodes whenever a user engages with an item. As show in figure 3 user nodes have ‘interact’ edges from user nodes to item nodes.)
Claim 5: Sun, Lu, and Ying teaches the limitations of claim 1, Lu further teaches:
The method of claim 1, further comprising: generating a visual embedding for the second content item, wherein the query vector is generated based on the visual embedding. (Lu, page 4, paragraph 2, “We introduce a co-attentional transformer layer shown in Fig. 2b. Given intermediate visual and linguistic representations H (i) V and H (j) W , the module computes query, key, and value matrices as in a standard transformer block. However, the keys and values from each modality are passed as input to the other modality’s multi-headed attention block. Consequentially, the attention block produces attention-pooled features for each modality conditioned on the other – in effect performing image-conditioned language attention in the visual stream and language-conditioned image attention in the linguistic stream.”, ViLBERT’s co-attentional transformer layer as shown above in figure 2b generates a visual embedding, H (i) V using a first modality for a query vector of an attention mechanism (Qv) and a second modality for a key vector and a value vector of the second modality attention mechanism (Kw/Vw).)
Claim 6: Sun, Lu, and Ying teaches the limitations of claim 1, Lu further teaches:
The method of claim 1, further comprising: generating a textual embedding based on the second content item, wherein the key vector is generated based on the textual embedding. (Lu, page 4, paragraph 2, “We introduce a co-attentional transformer layer shown in Fig. 2b. Given intermediate visual and linguistic representations H (i) V and H (j) W , the module computes query, key, and value matrices as in a standard transformer block. However, the keys and values from each modality are passed as input to the other modality’s multi-headed attention block. Consequentially, the attention block produces attention-pooled features for each modality conditioned on the other – in effect performing image-conditioned language attention in the visual stream and language-conditioned image attention in the linguistic stream.”, Although the ViLBERT model uses the visual/image modality to provide the key vectors when generating the textual embedding, those visual key vectors are themselves computed based on earlier textual embeddings via the reciprocal co-attention mechanism. As a result, it is interpreted by the examiner that the visual key vectors used to generate the textual embedding are based on the textual embedding.)

Claim 7: Sun, Lu, and Ying teaches the limitations of claim 1, Lu further teaches:
The method of claim 1, further comprising: combining the query vector of the first modality and the key vector of the second modality to obtain a combined vector; (Lu, page 4, paragraph 2, “We introduce a co-attentional transformer layer shown in Fig. 2b. Given intermediate visual and linguistic representations H (i) V and H (j) W , the module computes query, key, and value matrices as in a standard transformer block. However, the keys and values from each modality are passed as input to the other modality’s multi-headed attention block. Consequentially, the attention block produces attention-pooled features for each modality conditioned on the other – in effect performing image-conditioned language attention in the visual stream and language-conditioned image attention in the linguistic stream.”, The ViLBERT visual embedding H_V results from combining the query vector of the visual modality with the key vectors of the textual modality, forming a combined vector. A query vector Qv of the first modality and the key vector Kw of the second modality is combined to obtain a combined representation vector Hv.)
Sun further teaches:
and weighting the combined vector based on the knowledge graph to obtain a weighted vector. (Sun, page 1410, col. 1, paragraph 1, “e𝑎𝑔𝑔 is a representation vector that aggregates neighbor entities information, which is the linear combination of each triple representation and can be calculated in Equation 1. 
    PNG
    media_image7.png
    36
    255
    media_image7.png
    Greyscale

where e(ℎ, 𝑟, 𝑡) is the embedding of each triplet (ℎ, 𝑟, 𝑡) and 𝜋 (ℎ, 𝑟, 𝑡) is the attention score on each triplet e(ℎ, 𝑟, 𝑡). 𝜋 (ℎ, 𝑟, 𝑡) controls how much information being propagated from triplets e(ℎ, 𝑟, 𝑡).”, The Sun PDF teaches weighting such a vector using attention scores 𝜋 (h, r, t) derived from knowledge graph triplets, which control how much influence each triplet has in generating a final representation. Thus, the Sun mechanism can be applied to H_V as the combined vector, producing a weighted vector based on the structure and semantics of the knowledge graph.)
Claim 8: Sun, Lu, and Ying teaches the limitations of claim 7, Lu further teaches:
The method of claim 7, further comprising: combining the weighted vector with the value vector of the second modality, wherein the second feature embedding is based on the combination of the weighted vector and the value vector. (Lu, page 4, paragraph 2, “We introduce a co-attentional transformer layer shown in Fig. 2b. Given intermediate visual and linguistic representations H (i) V and H (j) W , the module computes query, key, and value matrices as in a standard transformer block. However, the keys and values from each modality are passed as input to the other modality’s multi-headed attention block. Consequentially, the attention block produces attention-pooled features for each modality conditioned on the other – in effect performing image-conditioned language attention in the visual stream and language-conditioned image attention in the linguistic stream.”, Sun teaches that the item embedding (second feature embedding) is generated by combining: (1) a weighted vector derived from attention over knowledge graph neighbors — where the attention score 𝜋 (h, r, t) controls the influence of each triplet — and (2) the item’s own modality-specific vector, such as image or text features (the value vector of the second modality). Lu, in turn, supports this structure by illustrating a general attention mechanism where attention-pooled features (e.g., H_V^{i+1}) are computed by conditioning on key and value vectors from another modality, showing how embeddings can result from the integration of value vectors across modalities. Together, these references teach the limitation that the second feature embedding is based on the combination of the weighted vector and the value vector of the second modality.)
Claim 9: Sun, Lu, and Ying teaches the limitations of claim 7, Lu further teaches:
The method of claim 1, further comprising: generating a first symmetric feature embedding using the first modality as the query vector and the second modality as the key vector and generating a second symmetric feature embedding using the second modality as a symmetric query vector and the first modality as a symmetric key vector, ((Lu, page 4, paragraph 2, “We introduce a co-attentional transformer layer shown in Fig. 2b. Given intermediate visual and linguistic representations H (i) V and H (j) W , the module computes query, key, and value matrices as in a standard transformer block. However, the keys and values from each modality are passed as input to the other modality’s multi-headed attention block. Consequentially, the attention block produces attention-pooled features for each modality conditioned on the other – in effect performing image-conditioned language attention in the visual stream and language-conditioned image attention in the linguistic stream.”, Lu teaches generating symmetric feature embeddings by applying a co-attention transformer layer in both directions: one using the first modality (e.g. visual) as the query and the second modality (e.g. textual) as the key, and vice versa for the other symmetrical embedding.)
wherein the second feature embedding is based on the first symmetric feature embedding and the second symmetric feature embedding. (Sun, page 1410, col. 2, section 4.2, paragraph 1, “Similar to the knowledge graph embedding module, the recommendation module also uses MKGs attention layer to aggregate neighbor entity information. In order to retain the 1-𝑛 hop information, we follow the setup from [28] that retains the output of the candidate user and item from the 𝑙-th layer. The output of different layers represents the information of different hops. We hence adopt the layer-aggregation mechanism[31] to concatenate the representations at each step into a single vector, which can be found as follows: 
    PNG
    media_image1.png
    28
    277
    media_image1.png
    Greyscale
”, Sun teaches that the item embedding ei* (second feature embedding) is constructed by aggregating multiple attention-based layer outputs as shown in the formula. While Sun does not explicitly implement symmetric query-key attention, Lu teaches how such symmetric embeddings can be generated via reciprocal co-attention transformer layers. A person of ordinary skill in the art would recognize combining these symmetric embeddings from Lu (i.e. Hv or Hw) under Sun’s aggregation mechanism to form the final item feature embedding ei*.)
The rationale for the combination of Sun with Lu is similar to that of claim 1 above.
Claim 11: This claim recites limitations that are substantially similar to claim 1, and as such a similar analysis applies.
Claim 11 also had the following additional limitations for consideration which Sun further teaches:
computing a loss function based on the first feature embedding and the second feature embedding; (Sun, page 1410, col. 2, paragraph 4, “Then, we optimize our recommendation prediction loss by using the Bayesian Personalized Ranking (BPR) loss [20]. Specifically, we assume that the observed records, which indicate more user preferences, should be assigned higher prediction scores than unobserved ones. The BPR loss can be constructed in Equation 11: 
    PNG
    media_image8.png
    45
    307
    media_image8.png
    Greyscale
”, a Bayesian Personalized Ranking (BPR) loss is computed based on matching score 𝑦ˆ(𝑢,𝑖) which is based on the first and second feature embedding.)
and updating parameters of the multi-modal graph encoder based on the loss function.  (Sun, page 1411, col. 1, paragraph 3, “We update the parameters in MKGs embedding module and recommendation module alternately. In particular, for a batch of randomly sampled (ℎ, 𝑟, 𝑡, 𝑡′ ), we update the knowledge graph embeddings for all entities. Then we sample a batch of (𝑢,𝑖, 𝑗) randomly, retrieve their representations from knowledge graph embedding.”, parameters are updated after a BPR loss is calculated.)


    PNG
    media_image9.png
    276
    335
    media_image9.png
    Greyscale

Figure 7 of Sun
Claim 12: Sun, Lu, and Ying teaches the limitations of claim 11, Sun further teaches:
The method of claim 11, further comprising: identifying a first content item and a second content item; (Sun, page 1413, col. 2, paragraph 2, “To intuitively demonstrate the role of multi-modal entities in the MKGAT model, we give a case study by randomly selecting a user 𝑢 from the Dianping dataset, and a relevant item.”, as shown in figure 7 above, to demonstrate their model on an example test dataset, Sun uses the MKGAT model to identify first and second (content) item nodes.)
determining that a user prefers the first content item over the second content item using similarity scores for the first content item and the second content item; (Sun, page 1413, col. 2, paragraph 2, “Benefiting from the attention mechanism, we can calculate the relevance score (unnormalized) between the candidate items and the entity (or items and users). We can also observe the relevance scores between each entity and other entities. The higher the relevance score is, the model believes that the current entity has a greater effect on the model.”, a relevance score between each user-item relationship entity is computed to determine if a user should be recommended a first item over a second item.)
and computing a ranking loss based on the determination, wherein the loss function includes the ranking loss. (Lu, page 4, paragraph 2, “The training of knowledge graph embedding considers the relative order between valid triplets and broken ones, and encourages their discrimination through a pairwise ranking loss 
    PNG
    media_image10.png
    41
    310
    media_image10.png
    Greyscale
”, a prediction loss (a ranking loss) that is further optimized by the BPR loss (the loss function) downstream is a ranking loss as explicitly defined.)
Claim 14: Sun teaches the limitations:
An apparatus for item recommendation, comprising: a knowledge graph component configured to generate a knowledge graph representing relationships between a plurality of users and a plurality of content items; (Sun, page 1408, col. 1, paragraph 1, “Then, CKG incorporates the user-item bipartite graph into the knowledge graph, in which each user’s behavior is represented as a triplet, (𝑒𝑢, 𝐼𝑛𝑡𝑒𝑟𝑎𝑐𝑡, 𝑒𝑖). 𝐼𝑛𝑡𝑒𝑟𝑎𝑐𝑡 = 1 means there exists an additional interact relation between 𝑒𝑢 and 𝑒𝑖 . Based on the item-entity alignment set, the user-item graph can be seamlessly integrated with knowledge graph as a unified graph.”, a collaborative knowledge graph (CKG) is generated that merges the user-item edges with the KG edges from the user-item knowledge graph used as input.)
a multi-modal graph encoder configured to generate a first feature embedding representing a user and a second feature embedding representing a content item of the plurality of content items based on the knowledge graph, (Sun, page 1410, col. 2, section 4.2, paragraph 1, “Similar to the knowledge graph embedding module, the recommendation module also uses MKGs attention layer to aggregate neighbor entity information. In order to retain the 1-𝑛 hop information, we follow the setup from [28] that retains the output of the candidate user and item from the 𝑙-th layer. The output of different layers represents the information of different hops. We hence adopt the layer-aggregation mechanism[31] to concatenate the representations at each step into a single vector, which can be found as follows: 
    PNG
    media_image1.png
    28
    277
    media_image1.png
    Greyscale
”, The MKGAT encoder produces the user embedding e*u and item embedding e*i directly from the knowledge graph. Hence, two feature embeddings of user and item are generated.)
Lu, in the same field of entity encoding, teaches the following limitations which the above art fails to teach:
wherein the second feature embedding is generated using a first modality for a query vector of an attention mechanism and a second modality for a key vector and a value vector of the attention mechanism; (Lu, page 4, paragraph 2, “We introduce a co-attentional transformer layer shown in Fig. 2b. Given intermediate visual and linguistic representations H (i) V and H (j) W , the module computes query, key, and value matrices as in a standard transformer block. However, the keys and values from each modality are passed as input to the other modality’s multi-headed attention block. Consequentially, the attention block produces attention-pooled features for each modality conditioned on the other – in effect performing image-conditioned language attention in the visual stream and language-conditioned image attention in the linguistic stream.”, ViLBERT’s co-attentional transformer layer as shown above in figure 2b generates a feature embedding, say Hv, using a first modality for a query vector of an attention mechanism (Qv) and a second modality for a key vector and a value vector of the second modality attention mechanism (Kw/Vw).)
Sun further teaches
and a recommendation component configured to compare the first feature embedding to the second feature embedding to obtain similarity scores between the users and the content items and to identify recommended content items to the users based on the similarity scores. (Sun, page 1408, col. 2, paragraph 2, “Recommendation Module[:] Taking knowledge graph embedding of entities (obtained by the knowledge graph embedding module) and a collaborative knowledge graph as input, the recommendation module also employ the MKGs entity encoder and MKGs attention layer to leverage corresponding neighbors to enrich the representation of users and items. Finally, the matching scores between users and items can be generated following traditional recommendation models.”, recommendation of items is produced when a matching score (similarity score) is great enough.)
The rationale for the combination of Sun with Lu is similar to that of claim 1 above.
Ying, in the same field of entity encoding, teaches the following limitation which the above prior art fails to teach:
wherein the attention mechanism takes an encoding matrix representing the knowledge graph as input; (Ying, abstract, “Our key insight to utilizing Transformer in the graph is the necessity of effectively encoding the structural information of a graph into the model.”,
Ying, page 4, section 3.1.2, paragraph 2, “We assign each (feasible) output value a learnable scalar which will serve as a bias term in the self-attention module. Denote Aij as the (i, j)-element of the Query-Key product matrix A, we have: 
    PNG
    media_image4.png
    47
    223
    media_image4.png
    Greyscale
”
Ying, page 5, paragraph 2, “The proposed edge encoding incorporates edge features via a bias term to the attention module. Concretely, we modify the (i, j)-element of A in Eq. (3) further with the edge encoding cij as: 
    PNG
    media_image5.png
    48
    431
    media_image5.png
    Greyscale
”, Ying expressly teaches encoding graph structure (i.e., structural information of a graph) and injecting that structural encoding into the self-attention module as additive terms to the attention score matrix. Specifically, Ying defines attention scores element-wise as Aij (the (i,j) element of the Query-Key product matrix A) and adds bφ(vi,vj) (a learnable scalar indexed by the node-pair structural relation φ(vi,vj), e.g., a distance-based structural relation) into Aij, i.e., directly into the attention computation. Ying further adds cij as an edge-encoding term, again injected into Aij as a bias term “to the attention module,” thereby encoding edge/relationship information into the attention mechanism. Under BRI, a “knowledge graph” is a graph of entities/nodes and relations/edges; Ying’s structural encodings φ(vi,vj) and edge encodings cij are graph-structure encodings defined over node pairs (i,j) and used to form/add to the attention score matrix A. Accordingly, these pairwise encodings constitute (or are readily representable as) an encoding matrix representing the (knowledge) graph, which is taken as an input to the attention mechanism via bias terms in the attention score matrix used to compute attention weights.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have further modified the teachings of Sun and Lu by incorporating the teachings of Ying, because Ying expressly teaches that effective use of Transformer attention on graph-structured data requires “encoding the structural information of a graph into the model” and does so by injecting graph structural encodings into the self-attention computation as additive terms to the attention score matrix Aij (e.g., adding bφ(vi,vj) and cij as bias terms in the attention module). (Ying, abstract, “necessity of effectively encoding the structural information of a graph into the model,” and the attention score definition Aij including graph-encoding bias terms.) A POSITA would have been motivated to incorporate Ying’s structural-encoding-into-attention techniques into the attention mechanisms used in Sun’s KG-based encoder (as modified by Lu’s co-attention) in order to improve how attention accounts for graph structure (e.g., hop/distance relationships and edge/relation heterogeneity) and thereby predictably obtain an attention mechanism that takes a graph-encoding matrix (i.e., matrix-form node-pair structural/edge encodings) as an input.

Claims 15- 16 are substantially similar to claims 5-6 and as such a similar analysis applies.

Claim 17: Sun, Lu, and Ying teaches the limitations of claim 14, Sun further teaches:
The apparatus of claim 14, further comprising: a training component configured to compute a loss function based on the first feature embedding and the second feature embedding (Sun, page 1410, col. 2, paragraph 4, “Then, we optimize our recommendation prediction loss by using the Bayesian Personalized Ranking (BPR) loss [20]. Specifically, we assume that the observed records, which indicate more user preferences, should be assigned higher prediction scores than unobserved ones. The BPR loss can be constructed in Equation 11: 
    PNG
    media_image8.png
    45
    307
    media_image8.png
    Greyscale
”, a Bayesian Personalized Ranking (BPR) loss is computed based on matching score 𝑦ˆ(𝑢,𝑖) which is based on the first and second feature embedding.)
and to update parameters of the multi-modal graph encoder based on the loss function. (Sun, page 1411, col. 1, paragraph 3, “We update the parameters in MKGs embedding module and recommendation module alternately. In particular, for a batch of randomly sampled (ℎ, 𝑟, 𝑡, 𝑡′ ), we update the knowledge graph embeddings for all entities. Then we sample a batch of (𝑢,𝑖, 𝑗) randomly, retrieve their representations from knowledge graph embedding.”, parameters are updated after a BPR loss is calculated.)
Claim 18: Sun, Lu, and Ying teaches the limitations of claim 14, Lu further teaches:
The apparatus of claim 14, wherein: the multi-modal graph encoder comprises a symmetric bimodal attention network. (Lu, page 4, paragraph 2, “However, the keys and values from each modality are passed as input to the other modality’s multi-headed attention block. Consequentially, the attention block produces attention-pooled features for each modality conditioned on the other – in effect performing image-conditioned language attention in the visual stream and language-conditioned image attention in the linguistic stream.”, it is interpreted by the examiner that the ViLBERT architecture comprises a symmetric bimodal attention network through its use of the co-attention transformer layer. The visual embeddings are conditioned on text, and the text embeddings are conditioned on vision.)
Claim 19: Sun, Lu, and Ying teaches the limitations of claim 18, Lu further teaches:
The apparatus of claim 18, wherein: the symmetric bimodal attention network comprises a first multi-head attention module corresponding to the first modality and a second multi-head attention module corresponding to the second modality. (Lu, page 4, figure 2, “Figure 2: We introduce a novel co-attention mechanism based on the transformer architecture. By exchanging key-value pairs in multi-headed attention, this structure enables vision-attended language features to be incorporated into visual representations (and vice versa).”, the co-attention transformer layer comprises both a first and second multi-head attention module of each modality (visual and textual).)

Claim 20: Sun, Lu, and Ying teaches the limitations of claim 14, Sun further teaches:
The apparatus of claim 14, further comprising: a search component configured to search for a plurality of candidate content items for recommendation to a user. (Sun, page 1413, col. 2, paragraph 2, “Benefiting from the attention mechanism, we can calculate the relevance score (unnormalized) between the candidate items and the entity (or items and users). We can also observe the relevance scores between each entity and other entities. The higher the relevance score is, the model believes that the current entity has a greater effect on the model. We visualize the relevance score in Figure 7.”, the relevance score calculation is necessary for the MKGAT model to search candidate content items for a recommendation to a user.)


Claim 2 is rejected under 35 U.S.C. 103 as being unpatentable by Sun in view of Lu and in further view of Ying and Zhang et al., (Zhang, J., Zhang, H., Xia, C., & Sun, L. (2020). Graph-bert: Only attention is needed for learning graph representations. arXiv preprint arXiv:2001.05140.), hereafter referred to as Zhang.

Claim 2: Sun, Lu, and Ying teaches the limitations of claim 1, Zhang in the same field of entity encoding teaches the following limitation which the above fails to teach:
The method of claim 1, further comprising: generating a spatial encoding matrix representing a number of hops between nodes of the knowledge graph, wherein the encoding matrix comprises the spatial encoding matrix. (Zhang, page 4, col. 2, paragraph 3, “Hop based Relative Distance Embedding The hop based relative distance embedding can be treated as a balance between the absolute role embedding (for global information) and intimacy based relative positional embedding (for local information). Formally, for node vj ∈ Vi in the subgraph gi , we can denote its relative distance in hops to vi in the original input graph as H(vj ; vi), which can be used to define its embedding vector as 
    PNG
    media_image11.png
    28
    293
    media_image11.png
    Greyscale
”, Zhang explicitly encodes the number of hops between nodes as embedding vectors using this hop-based relative distance method. These spatial embeddings can be integrated into Sun’s model as input features, complementing its multi-hop aggregation mechanism.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Sun, Lu, and Ying by incorporating the teachings of Zhang (i.e. hop-based spatial encoding matrices). A motivation for doing so would have been to enhance the representation by balancing global and local information. (Zhang, page 5, col. 2, paragraph 3, “The hop based relative distance embedding can be treated as a balance between the absolute role embedding (for global information) and intimacy based relative positional embedding (for local information).”)

Claim 3 is rejected under 35 U.S.C. 103 as being unpatentable by Sun in view of Lu and in further view of Ying and Lin et al., (Lin, Y., Liu, Z., Sun, M., Liu, Y., & Zhu, X. (2015, February). Learning entity and relation embeddings for knowledge graph completion. In Proceedings of the AAAI conference on artificial intelligence (Vol. 29, No. 1).), hereafter referred to as Lin.

Claim 3: Sun, Lu, and Ying teaches the limitations of claim 1, Lin further teaches the following which, Sun, Lu, and Ying fails to teach:
The method of claim 1, further comprising: generating an edge encoding matrix representing edge types between nodes of the knowledge graph, wherein the encoding matrix comprises the edge encoding matrix. (Lin, page 2183, col. 1, paragraph 5, “For each relation r, we set a projection matrix Mr ∈ R k×d, which may projects entities from entity space to relation space. With the mapping matrix, we define the projected vectors of entities as 
    PNG
    media_image12.png
    20
    243
    media_image12.png
    Greyscale
”,  The matrix Mr is generated for each relation type (i.e. edge type) within the knowledge graph embedding and is used to transform entity embeddings when that relation is involved. Each matrix represents the semantics of a specific edge type.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Sun, Lu, and Ying by incorporating the teachings of Lin (i.e. edge encoding matrix for specifying edge types). A motivation for doing so would have been to more accurately model the semantics of different relation types by projecting entities into relation-specific spaces, thereby improving the representational embedding. (Lin, page 2181, col. 2, paragraph 3, “To address this issue, we propose a new method, which models entities and relations in distinct spaces, i.e., entity space and multiple relation spaces (i.e., relation-specific entity spaces), and performs translation in the corresponding relation space, hence named as TransR.”)

Claims 10 and 13 are rejected under 35 U.S.C. 103 as being unpatentable by in view of Lu and in further view of Ying and Radford et al., (Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021, July). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748-8763). PmLR.), hereafter referred to as Radford.
Claim 10: Sun, Lu, and Ying teaches the limitations of claim 1, Radford, in the same field of entity encoding, teaches the following limitations which the above art fails to teach:
The method of claim 1, further comprising: computing a cosine similarity, wherein the similarity score is based on the cosine similarity. (Radford, page 3, col. 1, paragraph 3, “Given a batch of N (image, text) pairs, CLIP is trained to predict which of the N × N possible (image, text) pairings across a batch actually occurred. To do this, CLIP learns a multi-modal embedding space by jointly training an image encoder and text encoder to maximize the cosine similarity of the image and text embeddings”, Radford teaches computing a cosine similarity between multi-modal embeddings. This cosine similarity aligns with the similarity score (matching score) in Sun, which compares user and item embeddings via inner product.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Sun, Lu, and Ying by incorporating the teachings of Radford (i.e. cosine similarity scoring). A motivation for doing so would have been to improve the alignment and comparison across multimodal embeddings. (Radford, page 3, col. 1, paragraph 1, “CLIP learns a multi-modal embedding space by jointly training an image encoder and text encoder to maximize the cosine similarity of the image and text embeddings of the N real pairs in the batch while minimizing the cosine similarity of the embeddings of the N2 − N incorrect pairings”, using a cosine based approach into the embedding comparison would have been viable for comparing across different modalities)

Claim 13: Sun, Lu, and Ying teaches the limitations of claim 11, Sun further teaches:
The method of claim 11, further comprising: identifying a positive sample pair comprising a user and a first content item that is preferred by the user; (Radford, page 3, col. 1, paragraph 2, “Given a batch of N (image, text) pairs, CLIP is trained to predict which of the N × N possible (image, text) pairings across a batch actually occurred.”, the CLIP architecture is trained to identify actual pairs (positive sample pairs) of image and text from the training dataset.)
identifying a negative sample pair comprising the user and a second content item that is not preferred by the user; (Radford, page 3, col. 1, paragraph 3, “To do this, CLIP learns a multi-modal embedding space by jointly training an image encoder and text encoder to maximize the cosine similarity of the image and text embeddings of the N real pairs in the batch while minimizing the cosine similarity of the embeddings of the N2 − N incorrect pairings.”, CLIP is trained to minimize the cosine similarity of incorrect (negative) sample pairs.)
and computing a contrastive learning loss based on the positive sample pair and the negative sample pair, wherein the loss function includes the contrastive learning loss. (Radford, page 3, col. 1, paragraph 3, “We optimize a symmetric cross entropy loss over these similarity scores.”, Radford’s symmetric cross-entropy loss accounts for both positive and negative sample pairs, computing similarity scores for all possible pairs within a batch. By maximizing positive samples and minimizing negative ones, this loss functions as a contrastive learning loss.)
The rationale for the combination of Sun with Radford is similar to that of claim 10 above.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Wei, Y., Wang, X., Nie, L., He, X., Hong, R., & Chua, T. S. (2019, October). MMGCN: Multi-modal graph convolution network for personalized recommendation of micro-video. In Proceedings of the 27th ACM international conference on multimedia (pp. 1437-1445).
Tan, H., & Bansal, M. (2019). Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490.
Mialon, G., Chen, D., Selosse, M., & Mairal, J. (2021). Graphit: Encoding graph structure in transformers. arXiv preprint arXiv:2106.05667.
Zhang, J., Zhang, H., Xia, C., & Sun, L. (2020). Graph-bert: Only attention is needed for learning graph representations. arXiv preprint arXiv:2001.05140.

THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to HYUNGJUN B YI whose telephone number is (703)756-4799. The examiner can normally be reached M-F 9-5.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Usmaan Saeed can be reached at (571) 272-4046. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/H.B.Y./Examiner, Art Unit 2146                                                                                                                                                                                                        

/USMAAN SAEED/Supervisory Patent Examiner, Art Unit 2146
Read full office action
Prosecution Timeline

Jan 27, 2022
Application Filed
May 29, 2025
Non-Final Rejection — §101, §103
Aug 25, 2025
Interview Requested
Sep 02, 2025
Examiner Interview Summary
Sep 02, 2025
Applicant Interview (Telephonic)
Sep 04, 2025
Response Filed
Mar 05, 2026
Final Rejection — §101, §103
Apr 07, 2026
Interview Requested
Precedent Cases

Applications granted by this same examiner with similar technology

17/337,998
Patent 12536429
INTELLIGENTLY MODIFYING DIGITAL CALENDARS UTILIZING A GRAPH NEURAL NETWORK AND REINFORCEMENT LEARNING
2y 5m to grant Granted Jan 27, 2026
Study what changed to get past this examiner. Based on 1 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
18%
Grant Probability
49%
With Interview (+31.7%)
4y 7m
Median Time to Grant
Moderate
PTA Risk
Based on 17 resolved cases by this examiner. Grant probability derived from career allow rate.