DETAILED ACTION
Status of Claims
Claim(s) 1-5, 7-11, 14-15, and 17-20 are pending and are examined herein.
Claim(s) 1, 14, and 17-18 have been Amended. Claim(s) 6, 12-13, and 16 are Cancelled.
Claim(s) 1-5, 7-11, 14-15, and 17-20 remain rejected under 35 U.S.C. § 103.
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 12/01/2025 has been entered.
Response to Amendment
The amendment filed on December 01, 2025 has been entered. Claims 1-5, 7-11, 14-15, and 17-20 are pending in the application. Applicant’s amendments to claims have overcome the 35 USC § 101 rejection set forth in the Final Office Action mailed on May 30, 2025. Applicant’s amendments to the claims have been fully considered and are addressed in the rejections below.
Response to Arguments
Applicant's arguments with respect to the rejection under 35 U.S.C. § 101 filed on 12/01/2025 (Pp. 8-11 of the remarks) have been fully considered and are persuasive.
In particular, Applicant argues that claim 14 recites a specific computer-implemented multi-headed recommendation model that uses an early fusion stage and a late fusion stage with attention mechanism to jointly learn representation from implicit feedback and user reviews, and that the recommendation model is jointly trained with an additional output branch to reduce popularity bias. When considered as a whole, these limitations integrate the judicial exception into a practical application by providing an improvement to the operation of recommendation system.
Applicant's arguments regarding the rejection under 35 U.S.C. § 103 filed on 12/01/2025 (see remarks Pp. 11-14) have been fully considered but are not persuasive for the following reasons set forth below.
Applicant argues that the cited art fails to teach the amended limitations of claim 14 requiring an early fusion stage and late fusion stage that both use preference information to fuse review data. Applicant contends that the cited art does not teach the features as described in the specification (paragraph [0041]-[0043], Fig. 5). Applicant further asserts that similar limitations in the previously presented claims included in claims 16-17 and that the cited references do not disclose an early and late fusion stages that both incorporate item preference information for fusing review data as claimed.
The examiner respectfully disagrees with the applicant arguments. Under the broadest reasonable interpretation (BRI), limitations described in the specification but not recited in the claims cannot be read into the claims. See In re Van Geuns, 988 F.2d 1181, 26 USPQ2d 1057 (Fed. Cir. 1993). The specification details applicant cites are not recited in claim 14. Thus, the amended claim limitations are interpreted under the BRI in light of the specification.
Applicant further acknowledges that previously presented dependent claims (e.g., 17-18) recite specific operational limitations, including “concatenation,” a “review attention module,” and a “summarized review feature vectors.” These limitations were not incorporated into the amended independent claim 14 as previously presented. Accordingly, the scope of the claim is not the same as previously presented, and claim 14 is interpreted under the BRI as currently drafted.
The Examiner disagrees with the applicant’s assertion that Liu does not teach the amended claim limitations. Under the BRI in light of the specification, Liu does teach the early fusion stage as currently amended. The early fusion stage incorporates preference feature vectors that weight review feature vectors to obtain a latent representation. Liu discloses this limitation in the context of the fusion gated layer and its operation described in Equation 16 (Section: 4.2, p. 6), where
p
u
(preference vector from user-time interactions) explicitly participates in computing the gate mechanism, which weights review features to obtain a fused user review feature representation (
h
u
). Under the BRI, gating mechanism constitute “attention weights” because they compute importance scores and apply selective weighting, which is functionally equivalent to the claimed limitation.
Additionally, the Examiner respectfully disagrees with the applicant’s assertion that Liu does not teach the late fusion stage incorporating preference information. Under the BRI in light of the specification, Liu does teach the late fusion stage as currently amended. The late fusion gate combines the user review latent representation with preference feature vectors based on cross-modal attention weights. Liu discloses this limitation in the context of dynamic interaction component of Equations 17-19 (Section: 4.2), which combines the user review latent representation (
s
u
f
f
, which is the filtered version of
s
u
f
from early fusion) with the preference feature vector (
p
u
) to obtain a fused feature representation (
u
u
). Liu further disclose cross-model attention, where the gate mechanism is computed from both user features and item features (modalities) and applies selective weighting, where z denotes the user-item features, [,] combines the two features by concatenating the hidden layer.
Furthermore, the examiner notes that Liu teaches sequential operations of (fusion gated layer and filter gated layer) to fuse and concatenate review vectors and user-item vectors. Liu’s describes the following:
“Specifically, we design a gated network to dynamically fuse the extracted features and select the features that are most relevant to user preferences. .... We first propose a gated network to dynamically fuse reviews with interaction data and adjust the importance of features. Specifically, the fusion gated layer adaptively integrates and filters out the irrelevant features. A user-based filter gated layer is then used to distinguish the importance of the merged features and select relevant features. .... The dynamic interaction component is presented in Fig. 1,... the gated network dynamically fuses reviews with interaction data and adjusts the relevance weights of features. ... AHAG utilizes an adaptive gated network to fuse and control what information should be propagated for forecasting user preferences, thereby avoiding the noise that may be introduced by fusion features. This innovation also allows the model to distinguish user preferences for different item features. ... Fig. 2: Fig. 2. A network framework for learning the features of review information based on hierarchical attention. The framework consists of three attention layers: (1) Position self-attention layer: modeling the long-term interaction between words. (2) High-order layer: learning the relevant semantic information. (3) Co-attention layer: modeling the dynamic interaction of the user and item review features.”
Moreover, Applicant’s argument regarding concatenation being “inadequate” relies on specification language (para. [0042] & [0043]) that is not recited in the claim. As noted above, the claim language is given its plain meaning under the BRI as currently drafted. Claim 14 recites late fusion “based on cross-model attention weights,” without specifying any structural details or implementation of the “cross-model attention weights” other than combining the user review latent representation with the preference feature vectors to determine a set of fused feature representations, which Liu teaches through gating operations in Equations 17-19.
Accordingly, for at least the above reasons, Applicant’s arguments are not persuasive and the rejection is maintained. With respect to amended dependent claims 17-18 and all other claims, the Examiner refers to the updated prior art rejection under 35 U.S.C. § 103 for more details.
Claim Objections
Claim(s) 15 and 17-20 are objected to for the following reasons:
Claim 15 is objected to for reciting “The recommendation model of claim 14, ...” whereas independent claim 14 is directed to a non-transitory computer-readable medium. Claim 15 should be amended to recite “The non-transitory computer-readable medium of claim 14, ...” to maintain proper claim dependency and clearly refer to the statutory subject matter of its parent claim.
Claim 17 is objected to for improper dependency, as it depends from cancelled claim 16. Claim 17 must be amended to depend from a currently pending claim. The Examiner interpret claim 17 as depending from parent claim 14.
Claims 17-20 are objected to for the same reason as claim 15, as they recite dependency from “the recommendation model of claim 14” while claim 14 is directed to a non-transitory computer-readable medium.
Appropriate correction is required.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claim(s) 1-5 and 9-11 are rejected under 35 U.S.C. 103 as being unpatentable over Zhou et al., (NPL: "Noise Contrastive Estimation for Autoencoding-based One-Class Collaborative Filtering." (2020)) in view of Ma et al., (NPL: "Gated Attentive-Autoencoder for Content-Aware Recommendation." (2019)), further in view of Sun et al., (NPL: " Exploiting review embedding and user attention for item recommendation." (February, 2020)), and further in view of Tavernier et al., (Pub. No.: US 10706450 B1).
Regarding Amended Claim 1,
Zhou discloses the following:
A recommendation model stored on a non-transitory computer readable storage medium, the recommendation model associated with a set of parameters, and configured to receive a set of features associated with a user and a content item and to output a likelihood that the user will interact with the content item, wherein the recommendation model is manufactured by a process comprising: obtaining a training dataset that comprises: (Zhou, [P. 2, Col.1, Section: 1] “In summary, this paper makes two major key contributions: • We propose two variants of a novel two-headed Autoencoder based recommendation method called NS-AutoRec and NCE-AutoRec that respectively employ negative sampling and an alternative closed-form NCE solution to learn high quality “de-popularized” embeddings without sacrificing end task loss. Empirical results show that NCE-AutoRec is more efficient and scalable than NS-AutoRec. • We demonstrate that both NS-AutoRec and NCE-AutoRec are competitive with many state-of-the-art recommendation methods on four large-scale publicly available datasets while providing more personalization (i.e., focusing on recommending less popular and thus more nuanced items).” [P. 2, Section: 2.2] “Before proceeding, we define our notation as follows: • R: The positive-only feedback matrix form users and n items in the shape of m × n. Each entry of
r
i
,
j
is either 1, which indicates there is an interaction between user i and item j, or 0 otherwise (no interaction). We use r:,j to represent all user feedback for item j ∈ {1 · · · n}. •
R
^
and
R
~
: Respective reconstructions of the original matrix R based on Mean Squared Error (MSE) and Negative Sampling (NS) (in NS-AutoRec) or closed-form NCE solution (in NCE-AutoRec). Both have the same matrix shape as R. • R ∗ : The matrix that optimizes the NCE objective. It has the same shape as the matrix R. We will describe the NCE objective function shortly.” [P. 3, Section: 3] “… we employ a two-headed AutoRec structure instead of standard AutoRec as seen in Figure 1 (middle and right). The motivation of this structure naturally comes from the fact that while NS and NCE are good for training embeddings (where “depopularization” of the embedding effectively leads to more nuanced embedding models across the popularity spectrum), neither NS or NCE is a good end loss for the final recommendation task. To resolve this, we make a separate head for each objective – one to train the embedding via NS or NCE and the other to train for the recommendation task. Next we introduce NS-AutoRec and NCEAutoRec as two alternative approaches to combat popularity bias.”)
obtaining a training dataset that comprises: (Zhou, [P. 5, Col. 2, Section: 4.1] “For each user of each dataset, we use 50% interactions as training set, 20% as validation set and 30% as test set according to timestamps. We split the Goodbooks and Yahoo dataset randomly due to lack of timestamps. We set a threshold η to binarize the ratings in each dataset such that ratings greater or equal than η become 1 and otherwise 0. The threshold is 80 for Yahoo dataset and 3 for the rest.”) [Examiner’s Note: the collected dataset from different sources for training (e.g., Goodbooks or Netflix).]
implicit user feedback data, the implicit user feedback data including data characterizing interactions between a plurality of users including the user, and a plurality of content items that were presented to the plurality of users, the implicit user feedback data including numerical values indicating whether the plurality of users interacted with the plurality of content items; and (Zhou, [P. 2, Section: 2.1] “we define our notation as follows: • R: The positive-only feedback matrix form users and n items in the shape of m × n. Each entry of
r
i
,
j
is either 1, which indicates there is an interaction between user i and item j, or 0 otherwise (no interaction). We use r:,j to represent all user feedback for item j ∈ {1 · · · n}.”)
user review data, wherein the user review data includes a sequence of words from one or more reviews generated by the plurality of users, the one or more reviews associated with at least one content item of the plurality of content items; (Zhou, [P. 2, Col. 2, Section: 2.2] “Applying this sample-based triplet objective to Autoencoders was used by Word2Vec [11], which estimates the semantic meaning of words by exploiting word co-occurrence patterns. Word2Vec leveraged Negative Sampling (NS) according to an attenuated popularity distribution. With the success of word embeddings for natural language processing, researchers have also applied this training technique to many other application domains including recommender systems. For example, it is possible to achieve state-of-the-art performance on top-K recommendation tasks by simply borrowing the Word2Vec architecture without fundamental modifications [17].”)
for a two-headed attention fused autoencoder associated with the set of parameters, (Zhou, [Pp. 5-6, Section: 3.3] “We now describe four possible ways to optimize parameters of the two heads of NS-AutoRec and NCE-AutoRec as follows: Joint: We optimize all parameters through both objectives (heads and losses) jointly. Intuitively, we can jointly optimize the sum of NCE (or NS) and MSE objectives with stochastic gradient descent. …etc.”) wherein the two-headed attention fused autoencoder comprises an encoder coupled to a preference decoder and to a noise contrastive estimation (NCE) decoder that is separate from the preference decoder, (Zhou, [P. 3, Section: 3] “Figure 1: Architectures of the proposed recommender systems NS-AutoRec and NCE-AutoRec. (left) The original AutoRec architecture that reproduces its input through simple forward propagation. (middle) Negative Sampling enhanced AutoRec that jointly optimizes both NS (green) and MSE (blue) objectives. (right) Noise Contrastive Estimation enhanced AutoRec that learns an encoder network through optimizing the NCE objective (pink) and learns the decoder network through optimizing the MSE objective. The dashed arrows show backpropagation flows.” [Pp. 3-4, Section: 3.1] “Intuitively, we can jointly optimize the two objectives (one for each head) concurrently through a simple summation
L
θ
,
ϑ
M
S
E
+
L
ψ
,
ϑ
N
S
θ
,
ϑ
ψ
a
r
g
m
i
n
, (5) where ϑ denotes the encoder parameter set, and
θ
,
ψ
represent parameter sets of MSE and NS decoders, respectively…., Note the encoding network
f
ϑ
is intentionally shared for both decoder loss objectives
L
θ
,
ϑ
M
S
E
+
L
ψ
,
ϑ
N
S
. …, This reformulation allows us to optimize a lower-bound of the objective function by maximizing the two components 1 and 2 separately.”)
passing the set of … features through the noise contrastive estimation (NCE) decoder and the preference decoder; (Zhou, [P. 1, Section: 1] “we then train a separate decoder head that uses the learned item embeddings to perform the end task.” [P. 4, Section: 3.2] “3.2 NCE-AutoRec: In this section, we propose an alternative two-headed Autoencoder model called NCE-AutoRec, as shown in Figure 1 (right). Like NS-AutoRec, NCE-AutoRec also jointly minimizes the loss based on the sum of objectives for each head:
L
θ
,
ϑ
M
S
E
+
L
ψ
,
ϑ
N
S
θ
,
ϑ
ψ
a
r
g
m
i
n
. (9) As before, we use
r
^
i
to represent the prediction from the MSE objective, whereas we use
r
~
i
to represent the prediction from the NCE objective. We start our discussion with the NCE objective head. …, where the prediction of NCE head is
r
~
i
=
f
ϕ
∘
f
ϑ
(
r
i
)
(11) and scalar
r
~
i
,
j
represents the
j
t
h
entry of the vector
r
~
i
.”)
obtaining a first error term obtained from a first loss function associated with the NCE decoder; obtaining a second error term obtained from a second loss function, different from the first loss function, associated with the preference decoder; (Zhou, [P. 4, Section: 3.2] “3.2 NCE-AutoRec: In this section, we propose an alternative two-headed Autoencoder model called NCE-AutoRec, as shown in Figure 1 (right). Like NS-AutoRec, NCE-AutoRec also jointly minimizes the loss based on the sum of objectives for each head:
L
θ
,
ϑ
M
S
E
+
L
ψ
,
ϑ
N
S
θ
,
ϑ
ψ
a
r
g
m
i
n
. (9) As before, we use
r
^
i
to represent the prediction from the MSE objective, whereas we use
r
~
i
to represent the prediction from the NCE objective. We start our discussion with the NCE objective head. …, where the prediction of NCE head is
r
~
i
=
f
ϕ
∘
f
ϑ
(
r
i
)
(11) and scalar
r
~
i
,
j
represents the
j
t
h
entry of the vector
r
~
i
.”)
backpropagating a third error term to update the set of parameters associated with the recommendation model, wherein the third error term is calculated based on the first error term generated from the NCE decoder and the second error term generated from the preference decoder; (Zhou, [Pp. 3-4, Section: 3.1] “Intuitively, we can jointly optimize the two objectives (one for each head) concurrently through a simple summation
L
θ
,
ϑ
M
S
E
+
L
ψ
,
ϑ
N
S
θ
,
ϑ
ψ
a
r
g
m
i
n
, (5) where ϑ denotes the encoder parameter set, and
θ
,
ψ
represent parameter sets of MSE and NS decoders, respectively…., Note the encoding network
f
ϑ
is intentionally shared for both decoder loss objectives
L
θ
,
ϑ
M
S
E
+
L
ψ
,
ϑ
N
S
. …, This reformulation allows us to optimize a lower-bound of the objective function by maximizing the two components 1 and 2 separately.” [P. 4, Section: 3.2] “3.2 NCE-AutoRec: In this section, we propose an alternative two-headed Autoencoder model called NCE-AutoRec, as shown in Figure 1 (right). Like NS-AutoRec, NCE-AutoRec also jointly minimizes the loss based on the sum of objectives for each head:
L
θ
,
ϑ
M
S
E
+
L
ψ
,
ϑ
N
S
θ
,
ϑ
ψ
a
r
g
m
i
n
. (9) As before, we use
r
^
i
to represent the prediction from the MSE objective, whereas we use
r
~
i
to represent the prediction from the NCE objective.” [P. 3, Figure. 1] “The dashed arrows show backpropagation flows.” [Pp. 4-5, Section 3.3] “Joint: We optimize all parameters through both objectives (heads and losses) jointly. Intuitively, we can jointly optimize the sum of NCE (or NS) and MSE objectives with stochastic gradient descent. …, We optimize the NCE (or NS) objective until it is fully converged. We then optimize an MSE objective that blocks backpropagation to the encoder network by freezing the encoder weights. .., Full Fine-tune: We optimize the NCE (or NS) objective until it is fully converged. We then optimize the MSE objective without blocking backpropagation. This allows the MSE objective to fine-tune the embeddings learned through the NCE (or NS) objective. We treat the choice of Joint, Alternating, Limited Fine-tune and Full Fine-tune optimization methodologies as a training hyperparameter for NS-AutoRec and NCE-AutoRec.”)
stopping the backpropagation after the third error term satisfies a predetermined criteria; (Zhou, [P. 3, Figure. 1] “The dashed arrows show backpropagation flows.” [Pp. 4-5, Section 3.3] “Joint: We optimize all parameters through both objectives (heads and losses) jointly. Intuitively, we can jointly optimize the sum of NCE (or NS) and MSE objectives with stochastic gradient descent. …, We optimize the NCE (or NS) objective until it is fully converged. We then optimize an MSE objective that blocks backpropagation to the encoder network by freezing the encoder weights. .., Full Fine-tune: We optimize the NCE (or NS) objective until it is fully converged. We then optimize the MSE objective without blocking backpropagation. This allows the MSE objective to fine-tune the embeddings learned through the NCE (or NS) objective. We treat the choice of Joint, Alternating, Limited Fine-tune and Full Fine-tune optimization methodologies as a training hyperparameter for NS-AutoRec and NCE-AutoRec.”)
As noted above, Zhou teaches the use of user-item interaction dataset and applying Word2Vec embedding as part of the two-head autoencoder training. Zhou also teaches an encoder to generate feature representation of the user-item interactions data. Furthermore, Zhou teaches the training and optimization of the model parameters. However, Zhou does not appear to explicitly teach the following:
obtaining …, user review data, wherein the user review data includes a sequence of words from one or more reviews generated by the plurality of users, the one or more reviews associated with at least one content item of the plurality of content items;
generating a set of fused features based on the training dataset using the encoder by fusing separately encoded implicit feedback data and user review data; wherein the fusing includes applying an early fusion stage that combines review feature data to a summarized user review feature vector based on attention weights determined by the implicit feedback data and applying a late fusion stage that combines the summarized review feature vector with the implicit feedback data to determine a set of fused features based on cross-modal attention weights;
storing a subset of the set of parameters on the computer readable storage medium as a set of trained parameters of the recommendation model.
However, Zhou in view of Ma teaches the limitation:
obtaining a training dataset that comprises: implicit user feedback data, the implicit user feedback data including data characterizing interactions between a plurality of users including the user, and a plurality of content items that were presented to the plurality of users, the implicit user feedback data including numerical values indicating whether the plurality of users interacted with the plurality of content items; and user review data, wherein the user review data includes a sequence of words from one or more reviews generated by the plurality of users, the one or more reviews associated with at least one content item of the plurality of content items; (Ma, Figure 1, [Pp. 1-2, Section: I] “The encoder of the stacked AE encodes the user’s implicit feedback on a certain item into the item’s hidden representation. Then the word-attention module learns the item embedding from its sequence of words, where the informative words can be adaptively selected without using complex recurrent or convolutional neural networks. To smoothly fuse the representations of items’ ratings and descriptions, we propose a neural gating layer to extract and merge the salient parts of these two hidden representations, which is inspired by the long short-term memory (LSTM) [12]. Moreover, item-item relations provide important auxiliary information to predict users’ preferences, since closely related items may have the same topics or attributes.” [P. 3, Sections 3 & 4] “The recommendation task considered in this paper takes implicit feedback [14] as the training and test data. The user preferences are presented by an m-by-n binary matrix
R
. The entire collection of
n
items is represented by a list of documents
D
, where each document in
D
is represented by a sequence of words. The item relations are presented by a binary adjacent matrix
N
∈
R
n
×
n
, where
N
i
j
= 1 if item i and
j
are related or connected. Given the item descriptions
D
, the item relations N, and part of the ratings in R, the problem is to predict the rest of ratings in
R
.”)
generating a set of fused features based on the training dataset using the encoder by fusing separately encoded implicit feedback data and user review data; (Ma, Figure 1, [Abstract] “personalized recommender systems still face several challenging problems: (1) the hardness of exploiting sparse implicit feedback; (2) the difficulty of combining heterogeneous data. To cope with these challenges, we propose a gated attentive-autoencoder (GATE) model, which is capable of learning fused hidden representations of items’ contents and binary ratings, through a neural gating structure. Based on the fused representations, our model exploits neighboring relations between items to help infer users’ preferences. In particular, a word-level and a neighbor-level attention module are integrated with the autoencoder. The word-level attention learns the item hidden representations from items’ word sequences, while favoring informative words by assigning larger attention weights.” [P. 2, Col. 1, Section: 1] “To effectively fuse the hidden representations of items’ contents and ratings, we propose a neural gating layer to extract and combine the salient parts of them.” [P. 4] “Figure 1: The architecture of GATE. The yellow part is the stacked AE for binary rating prediction, and the green part is the word-attention module for item content. The blue rectangle is the gating layer to fuse the hidden representations. The middle pink part is the neighbor-attention module to obtain the hidden representation of an item’s neighborhood. Specifically, Word_Att denotes the word-attention layer, Neighbor_Att denotes the neighbor-attention layer, and Agg_Layer denotes the aggregation layer. ⊙ is the element-wise multiplication and ⊕ is the element-wise addition.” [Pp. 4-5, Section: 4.3] “The gate G and the fused item hidden representation
z
i
g
are computed by: … (See Equation (7)) … where
W
g
1
∈
R
h
×
h
,
W
g
2
∈
R
h
×
h
, and
b
g
∈
R
h
are the parameters in the gating layer. By using a gating layer, the salient parts from these two hidden representations can be extracted and smoothly combined.”) ... applying a late fusion stage that combines the summarized review feature vector with the implicit feedback data to determine a set of fused features based on cross-modal attention weights; (Ma, [Pp. 4-6, Sections: 4.2-4.4] “Given word embeddings of an item Di , a vanilla attention mechanism to compute the attention weights is represented by a two-layer neural network:
a
i
=
s
o
f
t
m
a
x
(
w
⊤
a
1
t
a
n
h
(
W
a
2
D
i
+
b
a
2
)
)
, (2) where
w
a
1
∈
R
h
,
w
a
2
∈
R
h
×
h
, and
b
a
2
∈
R
h
×
h
h are the parameters to be learned, the so f tmax(·) ensures all the computed weights sum up to 1. Then we sum up the embeddings in Di according to the weights provided by ai to get the vector representation of the item ...
z
i
c
=
∑
e
j
∈
D
i
a
i
,
j
e
j
.
(3) .... we adopt a matrix instead of
a
i
to capture the multi-dimensional attention and assign an attention weight vector to each word embedding. Each dimension of the attention weight vector represents an aspect of relations among all embeddings in
D
i
. Suppose we want
d
a
aspects of attention to be extracted from the embeddings, then we extend
w
a
to
w
a
1
∈
R
d
a
×
h
, which behaves like a high level representation of a fixed query "what are the informative words" over other words in the text:
A
i
=
s
o
f
t
m
a
x
(
W
a
1
t
a
n
h
W
a
2
D
i
+
b
a
2
+
b
a
1
)
,, (4) where
A
i
∈
R
d
a
×
l
i
, is the attention weight matrix, ba1 ∈ R da is the bias term, and the softmax is performed along the second dimension of its input. By multiplying the attention weight matrix with word embeddings, we have the matrix representation of an item:
Z
i
c
=
A
i
D
i
⊺
, (5) ... Then we have another neural layer to aggregate the item matrix representation into a vector representation. The hidden representation of the item is revised as:
z
i
c
=
a
t
(
Z
i
c
⊺
w
t
), (6) where
w
t
∈
R
d
a
, is the parameter in the aggregation layer, at (·) is the activation function. (Section: 4.2) ... The gate G and the fused item hidden representation
z
i
g
are computed by:
G
=
s
i
g
m
o
i
d
W
g
1
z
i
r
+
W
g
2
z
i
c
+
b
g
,
z
i
g
=
G
⊙
z
i
r
+
1
-
G
⊙
z
i
c
,
(7) ... By using a gating layer, the salient parts from these two hidden representations can be extracted and smoothly combined. (Section: 4.3) ... To simultaneously capture users’ preferences on a certain item and its neighborhood, the decoder in Eq. 1 is rewritten as:
z
i
(
3
,
g
)
=
a
3
w
3
z
i
g
+
b
3
,
z
i
(
3
,
n
)
=
a
3
w
3
z
i
n
+
b
3
,
r
^
i
=
a
4
w
4
z
i
(
3
,
g
)
+
w
4
z
i
(
3
,
n
)
+
b
3
,
(9).”) [Examiner’s Note: Under the BRI of the claim, Ma teaches the late fusion, specifically, the proposed model includes word-attention module, Gating Layer, and Neighbor-Attention Module, to combine summarized review feature vector (
z
i
c
) with implicit feedback data (
z
i
r
) to obtain a fused feature data (e.g.,
z
i
g
or
z
i
n
). (See Section: 4 & Figure 1.)]
Accordingly, at the effective filing date, it would have been prima facie obvious to one of ordinary skill in the art of machine learning to modify the proposed recommender systems NS-AutoRec and NCE-AutoRec of Zhou to incorporate the recommendation model (gated attentive-autoencoder (GATE)) for content-aware recommendation as taught by Ma to jointly train a recommendation model for predicting the user-item pairs. One would have been motivated to make such a combination in order to (1) help users easily discover products that they are interested in; (2) create opportunities for product providers to increase the revenue. Doing so would enhance the top-N recommendation performance for content-aware recommendation by effectively fusing heterogeneous data and the content item relation (Ma [Introduction]).
Zhou in view of Ma does not appear to explicitly teach:
applying an early fusion stage that combines review feature data to a summarized user review feature vector based on attention weights determined by the implicit feedback data .
storing a subset of the set of parameters on the computer readable storage medium as a set of trained parameters of the recommendation model.
However, Sun, in combination with Zhou and Ma, teaches the following:
applying an early fusion stage that combines review feature data to a summarized user review feature vector based on attention weights determined by the implicit feedback data (Sun, [P. 2, Section: 1] “In this paper, we propose an attentive deep review-based (ADR) model to address the aforementioned issues. Firstly, we assume that each user/item has her/its unique nuanced preferences/properties that could not be fully captured from review contents, which requires an auxiliary extraction from the user/item themselves. Secondly, for each item, we learn its salient features from the corresponding reviews. For each user, we aggregate her historical interactions to seek for salient features, where all the interacted items are fused adaptively by an attention mechanism. We devise an attention network for weight inference by jointly considering three aspects of information, which include both nuanced and salient features of the interacted items and the target item, as well as the nuanced preference of the user. The main contributions of this paper are as follows. – We propose a novel attentive deep review-based (ADR) recommendation model to address the challenge of accurately modeling users and items by involving both nuanced and salient features. In particular, items’ salient features are extracted from review contents and users’ salient features are the aggregation of their historical interacted items. – To effectively fuse a user’s historical feedback, we devise an attention network to infer the importance weights toward interacted items adaptively. To ensure the accuracy of inferred weights, we consider three aspects of information as the input for attention calculation, including both nuanced and salient features of the interacted items and target item, as well as the nuanced preference of the user.” [Pp. 4-5, Section: 3.1 Item representation] “In this case, the representation of each item could be expressed as the following equation:
e
j
=
f
w
j
,
1
,
w
j
,
2
,
.
.
.
,
w
j
,
l
q
j
=
v
j
+
e
j
where wj,l denotes the lth word of item j’s review document; e j and v j denote item j’s salient and nuanced features, respectively, and thus q j is the feature vector of item j; f (·) is a review feature extractor.” [Pp. 5-6, Section: 3.2 User representation] “we extend the basic idea of SVD++ with adaptive weights inferred by attention mechanism. In our model, the salient feature of user i, denoted as gi , is represented by the fusion of her interacted items’ nuanced and salient features
:
g
i
=
∑
j
∈
R
i
+
α
i
,
j
,
k
v
j
+
e
j
where
α
i
,
j
,
k
denotes user i’s attention to interacted item j when browsing current item k. The attention weight is inferred with an attention network which will be illustrated later. Finally, similar to item embedding, each user is represented by her nuanced and salient features:
p
i
=
u
i
+
∑
j
∈
R
i
+
α
i
,
j
,
k
v
j
+
e
j
.” [Pp. 6-8, Section: 3.3 Attention inference] “The aim of this section is to infer the weights of a user’s historical feedback adaptively, which is further used to aggregate the user’s salient preference as illustrated in Sect. 3.2. Intuitively, when user i is browsing a new item k, her attention to the interacted item j should be high if items k and j are similar or correlated. Specifically, there are three aspects of information that can help the model to learn this pattern. First of all, as shown in Fig. 1, similar items tend to share common features that are saliently revealed in reviews. ... The architecture of our attention network is summarized as Fig. 2. The common procedure of calculating attention [7] is feeding various inputs into an attention network, which firstly transforms them into the attention space and then aggregates them together to pass into a fully connected layer. The process can be formalized as the following equation:
α
i
,
j
,
k
=
ω
1
⊺
σ
T
1
⊺
u
i
+
T
2
⊺
q
j
+
q
k
+
b
1
+
b
2
where T1 and T2 indicate the matrices to transform user nuanced features and item feature, respectively. b1 denotes the bias vector of the first layer and b2 is the bias scalar of the second layer. w is the weight vector of the second layer. The calculated attention score vector is then transformed into a probability distribution vector by a softmax function as follows.
α
i
,
j
,
k
=
e
x
p
(
α
i
,
j
,
k
)
∑
j
∈
R
i
+
e
x
p
(
α
i
,
j
,
k
)
.) [Examiner’s Note: Sun discloses an ADR model that use an attention mechanism to fuse review embedding into a user representation. The model extracts salient from item review documents to obtain review embedding (i.e., review feature data) and aggregates them using an attention mechanism to create a user representation. Specifically, the model computes attention weights
α
i
,
j
,
k
that depend on the current user-item pair (
i
,
k
) and user preference
u
i
learned from interaction history (i.e., implicit feedback data). These attention weights are then used to aggregate review features:
g
i
=
∑
j
∈
R
i
+
α
i
,
j
,
k
v
j
+
e
j
, where
e
j
represents review-based feature and represents the attention weights determined based on user interaction (implicit feedback). The resulting servs as the summarized representation (i.e., user preference based on reviews from their interaction history).] and applying a late fusion stage that combines the summarized review feature vector with the implicit feedback data to determine a set of fused features based on cross-modal attention weights; (Sun, [Pp. 7-8, Section: 3.4] “The middle module is the attention module, where the feature vectors of a user’s historical interacted items are fused with different weights inferred by the attention network. It takes three aspects of information, user nuanced features, item nuanced features, and item salient features, as the attention source to form the user’s embedding representation as illustrated in Sect. 3.3. The top module is the prediction module, where item/user feature vectors are built according to Sects. 3.1 and 3.2, respectively. Then, we take the inner product of user/item feature vectors as the predicted preference:
x
^
i
,
k
=
p
i
+
∑
j
∈
R
i
+
α
i
,
j
,
k
v
j
+
e
j
⊺
(
v
k
+
e
k
)
.” Fig. 3 Our proposed ADR model consisting of three modules, namely embedding, attention and prediction modules)
Therefore, at the effective filing date, it would have been prima facie obvious to one of ordinary skill in the art to modify the combination of Zhou and Ma to incorporate the architecture of ADR as taught by Sun. One would have been motivated to make such a combination in order accurately modeling users and items by involving both nuanced and salient features. Doing so would improve the accuracy and diversity of recommender systems (Sun [Abstract]).
The combination of Zhou, Ma, and Sun does not appear to explicitly teach:
storing a subset of the set of parameters on the computer readable storage medium as a set of trained parameters of the recommendation model …
However, Tavernier, in combination with Zhou, Ma, and Sun, teaches the limitation:
storing a subset of the set of parameters on the computer readable storage medium as a set of trained parameters of the recommendation model, the subset of the set of parameters associated with the encoder and the preference decoder. (Tavernier, [Col. 10, Lines 10-13] “Upon completion of training, the trained system of models is stored for use in analyzing future queries submitted by the particular user.” [Col. 11, Lines 45-50] “the sequence to sequence model includes an encoder reading the input string and generating a representation of it, and a decoder that generates the output sequence …., The transformer network similarly has an encoder and a decoder.” [Col 12, Lines 1-3] “These trained machine learning models can then later be used to generate recommendations for users as they interact with the electronic catalog.” [Col. 17, Lines 25-40] “The trained models data repository 438 comprises one or more physical data storage devices that stores the parameters of machine learning models trained as described herein. For example, the trained models data repository 438 can store the parameters of a search intent prediction model trained according to the process 200. The interactive computing system 400 can communicate over network 404 with user devices 402. The network 404 can include any appropriate network, including an intranet, the Internet, a cellular network, a local area network or any other such network or combination thereof. User devices 402 can include any network-equipped computing device, for example desktop computers, laptops, smartphones, tablets, e-readers, gaming consoles, and the like. Users can access the interactive computing system 400 and interact with items therein via the network 404 and can be provided with recommendations via the network 404.”)
Accordingly, it would have been obvious to a person having ordinary skill in the art, before the effective filing date of the claimed invention, having the combination of Zhou, Ma, and Sun before them, to incorporate the trained models data repository which stores the model parameters upon training completion as taught by Tavernier. One would have been motivated to make such a combination in order to allow users to execute the trained machine learning model to generate recommendations for users as they interact with the electronic catalog. Doing so would significantly reduce this problem, allowing users to locate items of interest with fewer steps (Tavernier [Col. 6, Lines 28-45]).
Regarding Original Claim 2, the combination Zhou, Ma, Sun, and
Tavernier teaches the elements of claim 1 as outlined above, and further teaches:
wherein the encoder of the two-headed attention fused autoencoder comprises: (Zhou, [P. 3, Section: 3] “we employ a two-headed AutoRec structure instead of standard AutoRec as seen in Figure 1 (middle and right).” [P. 3, Section: 3] “Figure 1: Architectures of the proposed recommender systems NS-AutoRec and NCE-AutoRec. (left) The original AutoRec architecture that reproduces its input through simple forward propagation. (middle) Negative Sampling enhanced AutoRec that jointly optimizes both NS (green) and MSE (blue) objectives. (right) Noise Contrastive Estimation enhanced AutoRec that learns an encoder network through optimizing the NCE objective (pink) and learns the decoder network through optimizing the MSE objective. The dashed arrows show backpropagation flows.”) a preference encoder … (Zhou, [Pp. 3-4, Section: 3.1] “Intuitively, we can jointly optimize the two objectives (one for each head) concurrently through a simple summation
L
θ
,
ϑ
M
S
E
+
L
ψ
,
ϑ
N
S
θ
,
ϑ
ψ
a
r
g
m
i
n
, (5) where ϑ denotes the encoder parameter set, and
θ
,
ψ
represent parameter sets of MSE and NS decoders, respectively…., Note the encoding network
f
ϑ
is intentionally shared for both decoder loss objectives
L
θ
,
ϑ
M
S
E
+
L
ψ
,
ϑ
N
S
.”)
Zhou in view of Ma Further teaches: a preference encoder that takes the implicit user feedback data as input, and outputs a set of embedded preference feature vectors characterizing the implicit user feedback data. (Ma, Figure 1, [P. 3, Section: 4.1] “Inspired by the recent works using autoencoders (AEs) to model explicit feedback [31] and implicit feedback [39], we also adopt AE as our base building block due to its ability to learn richer representations and the close relationship to MF [39]. To capture users’ preferences on an item, we apply a stacked AE to encode users’ binary ratings
r
i
∈
R
m
on a certain item
i
into the item’s rating hidden representation
z
i
r
(the superscript
r
indicates the hidden representation is learned from items’ binary ratings): … (See Equation (1)) …, where
W
1
∈
R
h
1
×
m
,
W
2
∈
R
h
×
h
1
,
W
3
∈
R
h
1
×
h
, and
W
4
∈
R
m
×
h
1
, are the weight matrices.
m
is the number of users,
h
1
is the dimension of the first hidden layer, and h is the dimension of the bottleneck layer.
r
i
is a multi-hot vector, where
r
u
,
i
=
1
indicates that the user
u
prefers the item
i
.”)
Regarding Original Claim 3,
the combination Zhou, Ma, Sun, and
Tavernier
teaches the elements of claim 2 as outlined above, and further teaches:
wherein the encoder of the two-headed attention fused autoencoder further comprises: (Zhou, [P. 3, Section: 3] “we employ a two-headed AutoRec structure instead of standard AutoRec as seen in Figure 1 (middle and right).” [P. 3, Section: 3] “Figure 1: Architectures of the proposed recommender systems NS-AutoRec and NCE-AutoRec. (left) The original AutoRec architecture that reproduces its input through simple forward propagation. (middle) Negative Sampling enhanced AutoRec that jointly optimizes both NS (green) and MSE (blue) objectives. (right) Noise Contrastive Estimation enhanced AutoRec that learns an encoder network through optimizing the NCE objective (pink) and learns the decoder network through optimizing the MSE objective. The dashed arrows show backpropagation flows.”)
a review encoder that takes the user review feedback data as input and outputs a set of embedded review feature vectors, (Ma, Figure 1, [P. 3, Section: 4.2] “Embedding Layer. In the proposed module, the input of item
i
is a sequence of li words from its text description, where each word is represented as an one-hot vector. At the embedding layer, the one-hot encoded vector is converted into a low-dimensional realvalued dense vector representation by a word embedding matrix
E
∈
R
h
×
v
, where h is the dimension of the word embedding and
v
is size of the vocabulary. After converted by the embedding layer, the item text is represented as: …., where
D
i
∈
R
h
×
l
i
, and
e
j
∈
R
h
.”) wherein the set of embedded review feature vectors are generated based on the one or more reviews. (Ma, [P. 1, Section: 1] “item descriptions, e.g., users’ ratings on movies and movies’ plots.” [P. 5, Section: 5.1] “We select the user review with the highest helpfulness rating as the item’s description.”)
Regarding Original Claim 4,
the combination Zhou, Ma, Sun, and
Tavernier
teaches the elements of claim 3 as outlined above, and further teaches:
wherein the review encoder further comprises a word attention module that assigns attention weights to each word embedding in a review, the word attention module generating a review summarization feature vector for each review. (Ma, Figure 1, [Pp. 1-2, Section: 1] “GATE consists of a word-attention module, a neighbor-attention module, and a neural gating structure, integrating with a stacked autoencoder (AE). The encoder of the stacked AE encodes the user’s implicit feedback on a certain item into the item’s hidden representation. Then the word-attention module learns the item embedding from its sequence of words, where the informative words can be adaptively selected without using complex recurrent or convolutional neural networks. …., To learn the hidden representations from items’ sequences of words, we apply a word-attention module to adaptively distinguish informative words, leading to better comprehension of the item content. Our word-attention module can achieve the same performance with complex recurrent or convolutional neural networks yet with fewer parameters.” [P. 3, Section: 4.2] “Compared to learning from items’ bag-of-words, the attention weights learned by our module adaptively select the informative words with different importances, and make the informative words contribute more to depict items. …, The goal of the word-attention is to assign different importances on words, then aggregate word embeddings in a weighted manner to characterize the item. Given word embeddings of an item Di , a vanilla attention mechanism to compute the attention weights is represented by a two-layer neural network: (See Equation (2)), where
w
a
1
∈
R
h
,
W
a
2
∈
R
h
×
h
, and
b
a
2
∈
R
h
are the parameters to be learned, the
s
o
f
t
m
a
x
(
·
)
ensures all the computed weights sum up to 1. Then we sum up the embeddings in
D
i
according to the weights provided by
a
i
to get the vector representation of the item (the superscript
c
indicates the hidden representation is learned from items’ contents): …, However, assigning a single importance value to a word embedding usually makes the model focus on a specific aspect of an item content [22]. It can be multiple aspects in the item content that together characterize this item, especially when the number of words is large. …etc.”)
Regarding Original Claim 5,
the combination Zhou, Ma, Sun, and
Tavernier
teaches the elements of claim 3 as outlined above, and further teaches:
wherein the generation of the set of embedded review feature vectors further comprises, concatenating a set of review representation with a set of preference representation. (Ma, [P. 4, Col. 2, Section: 4.3] “We have obtained the item hidden representations from two heterogeneous data sources, i.e., the binary ratings and the content descriptions of items. The next aim is to combine these two kinds of hidden representations to facilitate the user preference prediction on unrated items. Unlike previous works [19, 37] regularizing these two kinds of hidden representations, we propose a neural gating layer to adaptively merge them.”)
Regarding Original Claim 9, the combination Zhou, Ma, Sun, and
Tavernier teaches the elements of claim 1 as outlined above, and further teaches:
wherein the NCE decoder comprises one or more feedforward neural network layers, wherein the NCE decoder reduces popularity bias by increasing the likelihood that the user will interact with the plurality of content items based on the implicit user feedback data. (Zhou, [P. 2, Section: 2.2] “AutoRec aims to minimize a Mean Squared Error (MSE) objective function (See Equation (1)) where we use Ω to represent all parameters in the AutoRec model. The prediction vector
r
^
i
is produced through forward propagation of the AutoRec network architecture as shown in Figure 1 (left).” [p. 3, Figure 1] “Architectures of the proposed recommender systems NS-AutoRec and NCE-AutoRec. (left) The original AutoRec architecture that reproduces its input through simple forward propagation. (middle) Negative Sampling enhanced AutoRec that jointly optimizes both NS (green) and MSE (blue) objectives. (right) Noise Contrastive Estimation enhanced AutoRec that learns an encoder network through optimizing the NCE objective (pink) and learns the decoder network through optimizing the MSE objective. The dashed arrows show backpropagation flows.” [P. 4, Col. 1, Section: 1] “We will see empirically that NS-AutoRec does produce improved embeddings for AutoRec that reduce popularity bias.” [P. 4, Section: 3.2] “Similar to NS-AutoRec described previously, the item probability
p
(
j
'
)
is a re-scaled empirical item popularity as described in Equation 6. …, With the analytical solution
R
*
of the NCE objective, we, then, aim to maximize the objective component 2 to reduce the gap between the Autoencoder prediction and the analytical solution
R
*
…, which serves as the NCE head objective in NCE-AutoRec.”)
Regarding Original Claim 10, the combination Zhou, Ma, Sun, and
Tavernier teaches the elements of claim 1 as outlined above, and further teaches:
wherein the preference decoder comprises one or more feedforward neural network layers, (Zhou, Figure 1, [P. 2, Section: 2.2] “The prediction vector
r
^
i
is produced through forward propagation of the AutoRec network architecture as shown in Figure 1 (left).) wherein the preference decoder generates a plurality of probabilities (Zhou, [P. 4, Section: 3.2] “Autoencoder prediction and the analytical solution
R
*
. … (See Equation (17)) which serves as the NCE head objective in NCE-AutoRec. The MSE objective of NCE-AutoRec is identical to the one in NS-AutoRec, as shown in Equation (8).”)
Zhou in view of Ma further teaches: wherein the preference decoder generates a plurality of probabilities corresponding to the plurality of content items, the plurality of probabilities indicating likelihoods that the user will interact with the plurality of content items. (Ma, [P. 3, Section: 3] “Given the item descriptions D, the item relations N, and part of the ratings in R, the problem is to predict the rest of ratings in R.” [P. 8, Section: 5.6] “This result demonstrates that the proposed word-attention module can effectively learn the item hidden representation from items’ descriptions. Third, from (1), (3), and (6), we observe that our neighbor-attention may play a critical role in the overall model. The results demonstrate that modeling users’ preferences on an item’s neighborhood is an effective supplementary for inferring their preferences on this item.”) [Note: the output vector
r
^
i
represents the likelihood that the user will interact with an item.]
Regarding Original Claim 11, the combination Zhou, Ma, Sun, and
Tavernier teaches the elements of claim 1 as outlined above, and further teaches:
wherein the third error term is calculated as a linear combination of the first error term from the NCE decoder and the second error term from the preference decoder. (Zhou, [P. 3, Section: 3.1] “we can jointly optimize the two objectives (one for each head) concurrently through a simple summation
L
θ
,
ϑ
M
S
E
+
L
ψ
,
ϑ
N
S
θ
,
ϑ
ψ
a
r
g
m
i
n
, (5).” [P. 4, Section: 3.2] “Like NS-AutoRec, NCE-AutoRec also jointly minimizes the loss based on the sum of objectives for each head:
L
θ
,
ϑ
M
S
E
+
L
ψ
,
ϑ
N
S
θ
,
ϑ
ψ
a
r
g
m
i
n
. (9) As before, we use
r
^
i
to represent the prediction from the MSE objective, whereas we use
r
~
i
to represent the prediction from the NCE objective.”) [Note: the linear combination of both loss function represents the third error term.]
Claim(s) 7-8 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Zhou, Ma, Sun, and Tavernier as outlined above, and further in view of Ni et al., (Pub. No.: US 20210065278 A1).
Regarding Original Claim 7,
the combination Zhou, Ma, Sun, and
Tavernier
teaches the elements of claim 3 as outlined above, and further teaches:
Ma describes the long short-term memory (LSTM) as part of the Gating Layer of the network architecture.
Zhou, Ma, Sun, and Tavernier do not appear to explicitly teach:
wherein the set of embedded review feature vectors are generated by using one or more bidirectional LSTM (long short-memory) neural networks.
However, Ni, in combination with Zhou, Ma, Sun, and Tavernier, teaches the limitation:
wherein the set of embedded review feature vectors are generated by using one or more bidirectional LSTM (long short-memory) neural networks. (Ni, [0057] “Any suitable neural network technique can be used to perform the encoding at block 220 in accordance with the embodiments described herein. Examples of suitable neural network techniques include, but are not limited to, Bidirectional Long Short-term Memory (BiLSTM), Convolutional Neural Network (CNN), Bidirectional Encoder Representations from Transformers (BERT), etc. Further details regarding block 220 are described above with reference to FIG. 1 and will now be described below with reference to FIG. 3.” Further see [0027].)
Accordingly, it would have been obvious to a person having ordinary skill in the art, before the effective filing date of the claimed invention, having the combination of Zhou, Ma, Sun, and Tavernier before them, to incorporate the system/method for implementing a recommendation system using an asymmetrically hierarchical network (AHN), which uses Bidirectional Encoder Representations from Transformers for feature embedding, as taught by Ni. One would have been motivated to make such a combination in order to incorporate a gating mechanism within a multiple attention weight modules to improve model performance. Doing so would enable the dynamic and hierarchical construction of effective user and item embeddings based on the most relevant information, thereby improving personalized recommendation accuracy and model interpretability (Ni [0016]).
Regarding Original Claim 8,
the combination Zhou, Ma, Sun, and
Tavernier
teaches the elements of claim 3 as outlined above, and further teaches:
As noted above, Ma also teaches the use of word-attention module and weighted loss for content item embedding vector and user preference from implicit feedback embedding vector.
Zhou, Ma, Sun, and Tavernier do not appear to explicitly teach:
generating modal attention weights based on the set of embedded preference feature vectors and the set of embedded review feature vectors; and generating the set of fused features by aggregating the set of embedded preference feature vectors and the set of embedded review feature vectors based on the modal attention weights.
However, Ni, in combination with Zhou, Ma, Sun, and Tavernier, teaches the limitations:
generating modal attention weights based on the set of embedded preference feature vectors and the set of embedded review feature vectors; and generating the set of fused features by aggregating the set of embedded preference feature vectors and the set of embedded review feature vectors based on the modal attention weights. (Ni, [0004] “The method further includes aggregating, using asymmetrically designed sentence aggregators, respective ones of the set of item sentence embeddings and the set of user sentence embeddings to generate a set of item review embeddings based on first item attention weights and a set of user review embeddings based on first user attention weights, respectively. The method further includes aggregating, using asymmetrically designed review aggregators, respective ones of the set of item review embeddings and the set of user review embeddings to generate an item embedding based on a second item attention weights and a user embedding based on second user attention weights, respectively. The method further includes predicting a rating of the user-item pair based on the item embedding and the user embedding.” [0051] “ the attention weights for the reviews of the user, βu, can be calculated by the URA 124 to adapt G to encode important reviews of the item by: βu=softmax(maxrow(G⊙ rowβv)) (12) where maxrow refers to row-wise max-pooling for obtaining the maximum affinity, ⊙row refers to the Hadamard product between each row, and βv=[β1 v, . . . βm v] (from Eq. (10)). Finally, the review embeddings can be aggregated by the URA 124 to generate the aggregated user review embedding 125, …” [0052] “The prediction layer 130 includes a component 132 configured to predict a rating of the user-item pair. More specifically, the component 132 is configured to receive the final item embedding 128-1 and the final user embedding 128-2, concatenate the final embeddings 128-1 and 128-2 to generate a final concatenated embedding, and feed the final concatenated into a predictive function to predict a rating of the user-item pair.”)
The same motivation that was utilized for combining Zhou, Ma, Tavernier, and Ni as set forth in claim 7 is equally applicable to claim 8.
Claim(s) 14-15 are rejected under 35 U.S.C. 103 as being unpatentable over Liu et al., (NPL: "Adaptive Hierarchical Attention-Enhanced Gated Network Integrating Reviews for Item Recommendation" (2020)) in view of Ameur et al., (NPL: "A deep neural network model for predicting user behavior on Facebook." (2019)).
Regarding Amended Claim 14,
Liu discloses the following:
A method comprising A non-transitory computer-readable medium storing instructions executable by one or more processors for: (Liu, [Abstract] “In this paper, we propose a novel Adaptive Hierarchical Attention-enhanced Gated network integrating reviews for item recommendation, named AHAG. AHAGis a unified framework to capture the hidden intentions of users by adaptively incorporating reviews.” [P.2, Col. 2] “Section 4 describes the overall framework of the proposed AHAG model in detail, as well as the algorithm learning.” [P. 3] “Fig. 1. Overview of the proposed AHAG model, which consists of the following key components: (1) A feature learning component: learning the features of reviews and interaction data. (2) Dynamic interaction component: achieving dynamic fusion and interaction of features.” [P.8, Section: 5.1] “Multi-pointer co-attention network exploits the pointer-based co-attention mechanism to extract important information from user and item reviews. Then the multi-pointer learning scheme is used to combine multiple views of user-item interactions for item recommendation. We implement the model with the code from the authors 2. ... Parameter Settings. We implemented the baseline methods based on the strategies in the papers in which they were proposed and repeated the experiment 5 times on the testing set to reduce the experimental error.... All subsequently discussed components were implemented in Python3 using the TensorFlow library.4”)
applying a first input branch of a recommendation model of trainable parameters comprising a preference encoder that is trained to generate a set of preference feature vectors characterizing a set of implicit user feedback data that describes user-item interactions as numerical values; (Liu, [P. 3, Figure. 1 and Section: 3] “Let
u
and
U
=
{
u
1
,
u
2
,
.
.
.
,
u
i
,
.
.
.
,
u
M
}
denote a user and the entire user set, respectively; similarly,
i
and
I
=
{
i
1
,
i
2
,
…
,
i
j
…
,
i
N
}
are used to denote an item and the entire item set, respectively.
R
∈
R
M
×
N
denotes the interaction matrix, which can be explicit real-valued ratings or implicit binary 0/1 feedback. … . Let
x
and
X
=
{
x
1
,
x
2
,
…
,
x
i
,
…
,
x
M
}
denote a user review text and the whole user review text set, respectively; similarly, let
s
and
S
=
{
s
1
,
s
2
,
…
,
s
,
…
,
s
M
}
denote an item review text and the whole item review text set, respectively. The AHAG model can be formalized as follows: Input: The input of interaction data is the identity of users and items. We use one-hot encoded vector
v
u
U
and
v
i
I
i that describe user
u
∈
U
and item
i
∈
I
, respectively. The input of the review data is the user review text
x
u
∈
X
, which is the review data set of user
u
, and the item review text
s
i
∈
S
is the review data set of item
i
.” [P.5, Col. 2, Section: 4.1.2] “The one-hot encoded item and user identities are taken as item feature vector
v
i
I
and user feature vector
v
u
U
to describe user and item, respectively. Then, the feature vectors
v
i
I
and
v
u
U
are mapped to low-dimensional dense latent factor vectors through the latent factor matrices
P
u
∈
R
M
×
K
and
Q
i
∈
R
N
×
K
in the embedded layer, which are expressed as follows: where
p
u
and
q
i
denote the interaction features of user and item, respectively.” Fig. 1. Overview of the proposed AHAG model, which consists of the following key components: (1) A feature learning component: learning the features of reviews and interaction data. (2) Dynamic interaction component: achieving dynamic fusion and interaction of features.) [Examiner’s Note: Figure 1 provides an overview of the AHAG model (recommendation model) which consist of rating-based feature component the includes separate input processing user interaction and item review information using embedding layer (i.e., preference encoder and review encoder). Liu’s rating-based feature learning constitutes a first input branch that encodes user-item interaction data into latent feature vectors (i.e., preference feature vector). ]
applying a second input branch of the recommendation model, separate from the first input branch, comprising a review encoder that is trained to generate a set of review feature vectors characterizing a set of user review data that describes user reviews of items as a sequence of words; (Liu, [P. 3, Figure. 1 and Section: 3] “Let
u
and
U
=
{
u
1
,
u
2
,
.
.
.
,
u
i
,
.
.
.
,
u
M
}
denote a user and the entire user set, respectively; similarly,
i
and
I
=
{
i
1
,
i
2
,
…
,
i
j
…
,
i
N
}
are used to denote an item and the entire item set, respectively.
R
∈
R
M
×
N
denotes the interaction matrix, which can be explicit real-valued ratings or implicit binary 0/1 feedback. … . Let
x
and
X
=
{
x
1
,
x
2
,
…
,
x
i
,
…
,
x
M
}
denote a user review text and the whole user review text set, respectively; similarly, let
s
and
S
=
{
s
1
,
s
2
,
…
,
s
,
…
,
s
M
}
denote an item review text and the whole item review text set, respectively. The AHAG model can be formalized as follows: Input: The input of interaction data is the identity of users and items. We use one-hot encoded vector
v
u
U
and
v
i
I
i that describe user
u
∈
U
and item
i
∈
I
, respectively. The input of the review data is the user review text
x
u
∈
X
, which is the review data set of user
u
, and the item review text
s
i
∈
S
is the review data set of item
i
.” [Pp. 4-5, Section: 4.1.1] “Review-Based Feature Learning: As shown in Fig. 2, our review-based feature learning uses three progressive steps of hierarchical attention. First, position self-attention layer provides a foundation for the subsequent extraction of contextual features by establishing long-term dependencies between review sequences. After that, high-order attention layer captures important semantic information by the multi-interaction of features. Using this basis, the dynamic interactions between user-item features are modeled by co-attention layer to capture the correlation of user-item feature pairs. ... Given a user u review information set xu, which consists of l words and can express full meaning. First, word vector model Global Vectors (GloVe) [47] is used to map words in xu into a word vector matrix F 2 Rdl , the order of words is preserved in matrix F. ... where d is word embedding dimension, ... represents the word vector of the ith word in
x
u
.”) [Examiner’s Note: Figure 1 provides an overview of the AHAG model (recommendation model) which consist of rating-based feature component the includes separate input processing user interaction and item review information using embedding layer (i.e., preference encoder and review encoder). Liu’s review-based feature learning constitutes a second input branch that is structurally and functionally separate from the rating-based branch (i.e., review encoder). The attention review mechanism process words representing user review text to generate review feature vectors (
h
u
,
h
i
).]
applying an early fusion stage of the recommendation model that combines review feature vectors to a user review latent representation based on attention weights determined by the set of preference feature vectors; (Liu, [Abstract] “we design a gated network to dynamically fuse the extracted features and select the features that are most relevant to user preferences.” [P. 6, Section 4.2 & Fig. 1] “The dynamic interaction component exploits the features extracted by the feature learning component to achieve feature fusion and rating prediction. …. Fusion Gated Layer. Inspired by [44], [45], we propose the feature fusion gated layer to dynamically fusing the review features and rating features. The fusion of user review features and user rating features are as follows.
s
u
'
=
δ
w
u
f
h
u
+
p
u
+
b
u
…
s
u
f
=
s
u
'
h
u
+
1
-
s
u
'
p
u
(16) where
s
u
'
is the gate applied to new user review features
h
u
, and
s
u
f
is the fusion feature that combines user rating and review features.... Because the merged feature contains review information, and it may introduce noise, we use a filter to combine the fusion feature and the original interaction feature. The range of filter function is from 0 to 1. If the fusion feature is beneficial to improve performance, the larger the value of the filter function is, otherwise, the smaller it is, to reduce the interference of noise. The filter s 00 u and user interaction features pu are defined as follows:
s
u
'
'
=
δ
w
u
p
+
w
u
f
s
u
f
…
s
u
f
f
=
s
u
'
'
tanh
w
u
f
f
s
u
f
+
b
u
f
f
(
17
)
...”) [Examiner’s Note: Under the broadest reasonable interpretation of the claim limitation, the gating mechanism (
s
u
'
) broadly interpreted as “attention weights” since it uses preference features (
p
u
) together with the review summary (
h
u
) to compute weight that selectively fuse the review and preference features, which is functionally equivalent to the claimed limitation. Liu particularly states: “dynamically fusing the review features and rating features.” The result of the gated fusion
s
u
f
(i.e., early fusion stage), which combines the review feature vectors
(
h
u
)
by incorporating the preference feature vectors
(
p
u
)using gating weights
(
s
u
'
)
.]
applying a late fusion stage of the recommendation model that combines the user review latent representation with the preference feature vectors to determine a set of fused feature representations based on cross-modal attention weights; (Liu, [Abstract] “we design a gated network to dynamically fuse the extracted features and select the features that are most relevant to user preferences.” [P. 6, Section 4.2 & Fig. 1] “The dynamic interaction component exploits the features extracted by the feature learning component to achieve feature fusion and rating prediction. …. we use a filter to combine the fusion feature and the original interaction feature. The range of filter function is from 0 to 1. If the fusion feature is beneficial to improve performance, the larger the value of the filter function is, otherwise, the smaller it is, to reduce the interference of noise. The filter
s
u
'
'
and user interaction features
p
u
are defined as follows:
s
u
'
'
=
δ
w
u
p
+
w
u
f
s
u
f
…
s
u
f
f
=
s
u
'
'
tanh
w
u
f
f
s
u
f
+
b
u
f
f
(
17
)
...
u
u
=
p
u
+
s
u
f
f
, where
w
u
p
,
w
u
f
,
w
u
f
f
,
b
u
f
f
,
are parameters, and
s
u
f
f
is the reserved feature that filters out noise.
u
u
is the review information enhanced user vector. We use the same method to derive the review information enhanced item vector
v
i
. Filter Gated Layer. A user usually focuses on a certain part of the item. Thus, we adopt a user-based filter gated layer to control the features propagated to the user preference prediction task.
v
i
F
=
v
i
⊙
δ
u
u
w
u
+
v
i
w
i
+
b
F
,
(18) where
v
i
F
denotes the item features after being filtered,
δ
is the activation function,
w
u
∈
R
d
u
×
d
u
and
w
i
∈
R
d
i
×
d
i
represent the weight matrix, and
b
F
depicts the bias. Inspired by [17], [41], we use intuitive connection operations to integrate user and item features.
z
=
u
u
,
v
i
F
,
(19) where z denotes the user-item features, [,] combines the two features by concatenating the hidden layer.”) [Examiner’s Note: Liu’s framework provides a dynamic interaction component exploits the feature extracted by the feature learning component to achieve feature fusion and ration prediction. Under the broadest reasonable interpretation (BRI) of the claim limitation, equation 17 combines the user review latent representation (
s
u
f
f
, which is the filtered version of
s
u
f
from early fusion) with the preference feature vector (
p
u
) to determine fused feature representation (
u
u
). Furthermore, Equation 18 provides gating mechanism constitutes “cross-model attention weights” that computes for both user and item modalities and applies selective weighting.]
applying an output branch of the recommendation model that generates a set of that generates a set of likelihood scores for a set of candidate content items based on the set of fused feature representations, (Liu, [P. 3, Col. 2, Section: 3] “Output: The whole training process can be expressed by function:
f
u
:
U
,
X
,
I
,
S
→
R
^
. The output of the model is prediction rating
R
^
. That is, for any user
u
, we can obtain the prediction rating
r
^
u
,
i
based on the function
f
u
:
v
u
U
,
v
i
I
,
x
u
,
s
i
→
r
^
u
,
i
.” [P.4, Col. 1, Section: 4] “NFM derives the final prediction score through high-order nonlinear interaction of features.” [P.6, Section: 4.2] “The dynamic interaction component exploits the features extracted by the feature learning component to achieve feature fusion and rating prediction. The dynamic interaction component is presented in Fig. 1 and is formed by the following key layers: (1) Fusion gated layer: Integrating rating and review features dynamically. (2) Filter gated layer: Adjusting the weights of the fused item features adaptively. Then, NFM derives the rating prediction. …we use NFM to capture the high-order nonlinear interaction of features. The objective function can be expressed as follows: …, The final predictive objective function is as follows: …,where
r
^
u
,
i
(
z
)
denotes the prediction score… [Section: 4.3] “4.3 Training To learn the parameters of the AHAG model, we exploit the regression with squared loss as the objective function: ... where R is the user-item rating matrix,
r
^
u
,
i
is the real rating of user
u
for item
i
,
r
^
u
,
i
is the prediction rating, and ... denotes all the parameters. ... is used as a regularization to prevent the model from overfitting. The entire framework can be effectively trained by using end-to-end paradigm reverse propagation. Algorithm 1 illustrates the training process of the AHAG model.”) the set of likelihood scores indicating how likely a user will interact with each of the set of candidate content items, (Liu, [P. 3, Section: 2.3] “This innovation also allows the model to distinguish user preferences for different item features. Then, the NFM is employed for modeling high-order nonlinear interactions between features” [P. 11, Section: 5.4.1] “It shows that combine of the high-order attention layer and the co-attention layer, which can highlight important features and capture more accurate user and item features to improve recommendation performance. …, This shows that the dynamic interaction of user-item feature pairs can better capture the item features correlated to user preferences.” [P. 13, Section: 5.4.4] “The observed correlations between the heat map of the user review text and the heat map of item review text reveal that the proposed AHAG model can effectively capture relevant semantic information in the reviews of the user-item pair.”) [Examiner’s Note: the final processing stage providing prediction ratings which represents the likelihood of user interaction based on the fused feature representation (z).]
As outlined above, Liu teaches adaptive multi-head attention framework for item recommendation (AHAG). The model includes an output branch (NFM). However, Liu does not appear to explicitly teach:
wherein the trainable parameters of the recommendation model is jointly trained with an additional output branch separate from the output branch to reduce popularity bias using a set of training data of user-item interactions.
However, Liu in view of Ameur teaches the following:
wherein the trainable parameters of the recommendation model is jointly trained with an additional output branch separate from the output branch to reduce popularity bias using a set of training data of user-item interactions. (Ameur, [Pp. 1-2, Section: I] “A joint autoencoders model was introduced to learn a fused representation of users from their like and comment views. This model is also used to combine the user and post representations to embed a user behavior. 3) we constructed a large dataset from Facebook to train and evaluate the proposed model. Our experimental results proved that the proposed model achieved better results than baseline models.” [Pp. 4-5, Section: III] Fig. 2. The deep joint auto-encoders “DeepJAE”. network. “… we performed the auto-encoder network with two disjoint inputs and outputs (one for each view), with separable hidden layers, as illustrated in Fig. 2. In other terms, the two views are available in the input and the both are reconstructed. Thus, this network includes a one fully connected hidden layer in common that interacts with both views in order to learn a joint representation. Indeed, the middle hidden layer activation is used as a bi-view embedding representation “fused representation”. Fig. 2 shows the deep joint auto-encoders “DeepJAE” topology that is equivalent to using two separate deep autoencoders and tying them in one hidden layer. Each deep autoencoder tries to reconstruct its input by following multiple encoding and decoding steps. As shown in Fig. 2 the encoder part in the joint auto-encoders consists of the layers (
L
x
1
,
L
y
1
,
L
x
2
,
L
y
2
a
n
d
L
z
) while the decoder part consists of the layers (
L
z
,
L
r
x
1
,
L
r
y
1
,
L
r
x
a
n
d
L
r
y
). …, Next,
h
z
will be decoded into two disjoint representations
h
r
x
1
and
h
r
y
1
with the same size of the
h
x
1
and
h
y
1
representations (see equation 7 and 8 ), in order to reconstruct
h
x
1
and
h
y
1
. Finally,
h
r
x
1
and
h
r
y
1
representations are also decoded into
r
x
and
r
y
to reconstruct both view representations
x
and
y
by computing equations 9 and 10. …, Training the joint auto-encoders is achieved by reducing the distance between the original data (input vectors
x
and
y
) and its reconstruction (output vectors
r
x
and
r
y
). … [P. 5, Section: C] “To achieve the classification step, we proposed to represent the user behavior toward a candidate post giving the pair (
v
u
,
v
p
). For this reason, we applied the joint auto-encoders “JAE” model in its simplest form (without any separated hidden layers) to learn the behavior embedding representation, as shown in Fig. 1 (top right).”) [Examiner’s Note: Under the BRI, the recitation of “to reduce popularity bias” merely represents the intended results of the joint training, not structural limitation requiring any specific bias-reduction mechanism. Ameur deep joint auto-encoders model to fuse the users’ like and comment information. The DeepJAF includes “two disjoint inputs and outputs.” The model is jointly trained with additional output branches using user-post interactions.]
Accordingly, at the effective filing date, it would have been prima facie obvious to one ordinarily skilled in the art of machine learning to modify the AHAG network of Liu to incorporate the joint auto-encoders model taught in Ameur to jointly train a neural network framework for predicting the user’s behavior toward a given content. One would have been motivated to make such a combination in order to achieve better results than the previous alternative methods. Doing so would improve the result of the recommendation system (Ameur [Pp. 7-8, Section C]).
Regarding Original Claim 15,
Liu in view of Ameur teaches the elements of claim 14 as outlined above, and further teaches:
wherein the review encoder further comprises a word attention module that assigns attention weights to each word embedding in a review, the word attention module generating a review summarization feature vector for each review. (Liu, [Pp. 2-3, Section: 2.2] “Attention-Based Recommendation: … it is used to recognize the important words from the reviews to enhance recommendation accuracy. … employed dual attention layers with CNNs to select the words to highlight users’ preferences and items’ properties and visualized those informative words to interpret the results. However, the CNN-based attention neglect the long-term dependency of review sequence, which may lose related semantic information. NARRE [5] adopted the attention to adjust the weight of reviews in parallel neural networks.” [Pp. 4-5, Section: 4.1.1] “Position self-attention layer. There is often a complex dependency between words in reviews. Inspired by [26], [46], this paper uses position self-attention to learn the long-term dependencies between words in the reviews. Given a user
u
review information set
x
u
, which consists of
l
words and can express full meaning. First, word vector model Global Vectors (GloVe) [47] is used to map words in
x
u
into a word vector matrix
F
∈
R
d
×
1
, the order of words is preserved in matrix
F
. …, To achieve the filtering of important features, the second convolution is performed. Then the attention weight
z
i
l
-
a
t
t
-
2
of word in the position
i
is obtained. The bigger the
z
i
l
-
a
t
t
-
2
, the more important the word is, and vice versa. Therefore, the word in the position i is expressed as: (see Equation (2)) where
w
^
i
denotes the word at position
i
,
z
i
l
-
a
t
t
-
2
indicates the attention weight of the second convolution, and
c
i
is the word vector representation with different local attention weights. …, The final contextual feature hu with user-item correlation is then obtained.”) [Note: the Review-Based Feature Learning Component includes Attention Layers for words embedding from reviews. Words importance is determined using weights.]
Claim(s) 17-18 are rejected under 35 U.S.C. 103 as being unpatentable over Liu in view of Ameur as outlined above, and further in view of Liu-2 et al., (NPL: "User Diverse Preference Modeling by Multimodal Attentive Metric Learning." (2019)).
Regarding Amended Claim 17,
Liu in view of Ameur teaches the elements of claim 14 as outlined above, and further teaches:
Liu teaches the dynamic interaction component, which includes the fusion gated layer to dynamically fusing the review features and rating features. Liu further highlights the MPCN which describes the Multi-pointer co-attention network exploits the pointer-based co-attention mechanism to extract important information from user and item reviews. Then the multi-pointer learning scheme is used to combine multiple views of user-item interactions for item recommendation.
However, Liu in view of Ameur does not appear to explicitly suggest:
generating a set of concatenated feature vectors by concatenating a set of review representations with a set of preference representations.
However, Liu-2, in combination with Liu and Ameur, teaches the limitation:
generating a set of concatenated feature vectors by concatenating a set of review representations with a set of preference representations. (Liu-2, Figure 1: Overview of our MAML model. [Pp. 3-5, Sections: 3.1-3.2] “3.2.1 Overview. In the aforementioned methods, they all utilize a fixed vector pu to represent a user u’s preference in the feature space. In those models which map users and items into a joint latent space for similarity estimation, they all assume that each dimension in the space stands for a type of feature or an aspect of the items.…. In light of this, we propose a multimodal attentive metric learning (MAML) model. For each user-item (u,i) pair, our model computes a weight vector au,i ∈ Rf to indicate the importance ofi’s aspects for u. In addition, the side information of items is exploited to estimate the weight vector, as side information conveys rich features of items, especially text reviews and item images, which are well-recognized to provide notable and complementary features of items in different aspects [9, 54]. We adopt the recent advancement of attention mechanism [6, 10] to estimate the attention vector. [3.2.2] 3.2.2 Attention Mechanism. In this section, we introduce the attention mechanism in MAML for capturing a user
u
’s specific attention
a
u
,
i
of an item
i
. Since text reviews and images contain rich information about user preference and item characteristic, they are used to capture
u
’
s
attention on the various aspects of
i
. A two-layer neural network is used to compute the attention vector:
e
u
,
i
=
T
a
n
h
(
W
1
[
p
u
;
q
i
;
F
t
v
,
i
]
+
b
1
)
, (4)
a
^
u
,
i
=
v
T
R
e
L
U
(
W
2
e
u
,
i
+
b
2
)
, (5) where
W
1
,
W
2
and
b
1
,
b
2
are respectively the weight matrices and bias vectors of the two layers.
v
is a vector that projects the hidden layer into an output attention weight vector.
F
t
v
,
i
is the item feature vector which is a fusion of
i
’s textual feature and image feature (described later).
[
p
u
;
q
i
;
F
t
v
,
i
]
denotes the concatenation of
p
u
,
q
i
, and
F
t
v
,
i
.
T
a
n
h
and
R
e
L
U
[29, 30, 36] are used as the activation functions for the first and second layer, respectively.”) [Examiner’s Note: Liu-2 generates a concatenated feature vector (i.e.,
[
p
u
;
q
i
;
F
t
v
,
i
]
) by concatenating preference representation from user-item interaction (i.e.,
p
u
) with review representation from item reviews (i.e., textual and visual reviews
F
t
v
,
i
).]
Accordingly, it would have been obvious to a person having ordinary skill in the art, before the effective filing date of the claimed invention, having the combination of Liu and Ameur before them, to incorporate the Multimodal Attentive Metric Learning (MAML) method to model user diverse preferences for various items as taught by Liu-2. One would have been motivated to make such a combination in order to enhance the recommendation accuracy by considering multisource item information. Doing so the recommendation performance can be greatly improved with the exploitation of additional item features, which has been demonstrated in many previous studies (Liu-2 [Section: 4.2]).
Regarding Amended Claim 18, the combination of Liu, Ameur, and Liu-2 teaches the elements of claim 17 as outlined above, and further teaches:
generating review attention weights for the early fusion stage by inputting the concatenated feature vectors into a review attention module. (Liu-2, Figure 1: Overview of our MAML model. [Abstract] “In particular, for each user-item pair, we propose an attention neural network, which exploits the item’s multimodal features to estimate the user’s special attention to different aspects of this item.” [Pp. 3-5, Sections: 3.1-3.2] “3.2.1 Overview. In the aforementioned methods, they all utilize a fixed vector pu to represent a user u’s preference in the feature space. In those models which map users and items into a joint latent space for similarity estimation, they all assume that each dimension in the space stands for a type of feature or an aspect of the items.…. In light of this, we propose a multimodal attentive metric learning (MAML) model. For each user-item (u,i) pair, our model computes a weight vector au,i ∈ Rf to indicate the importance ofi’s aspects for u. In addition, the side information of items is exploited to estimate the weight vector, as side information conveys rich features of items, especially text reviews and item images, which are well-recognized to provide notable and complementary features of items in different aspects [9, 54]. We adopt the recent advancement of attention mechanism [6, 10] to estimate the attention vector. [3.2.2] 3.2.2 Attention Mechanism. In this section, we introduce the attention mechanism in MAML for capturing a user
u
’s specific attention
a
u
,
i
of an item
i
. Since text reviews and images contain rich information about user preference and item characteristic, they are used to capture
u
’
s
attention on the various aspects of
i
. A two-layer neural network is used to compute the attention vector:
e
u
,
i
=
T
a
n
h
(
W
1
[
p
u
;
q
i
;
F
t
v
,
i
]
+
b
1
)
, (4)
a
^
u
,
i
=
v
T
R
e
L
U
(
W
2
e
u
,
i
+
b
2
)
, (5) where
W
1
,
W
2
and
b
1
,
b
2
are respectively the weight matrices and bias vectors of the two layers.
v
is a vector that projects the hidden layer into an output attention weight vector.
F
t
v
,
i
is the item feature vector which is a fusion of
i
’s textual feature and image feature (described later).
[
p
u
;
q
i
;
F
t
v
,
i
]
denotes the concatenation of
p
u
,
q
i
, and
F
t
v
,
i
.
T
a
n
h
and
R
e
L
U
[29, 30, 36] are used as the activation functions for the first and second layer, respectively.”) [Examiner’s Note: Liu-2 MAML model inputs the concatenated feature vector (
[
p
u
;
q
i
;
F
t
v
,
i
]
) into a two-layer attention neural network (Equation 4-5) that generates attention weights (
e
u
,
i
and
a
^
u
,
i
). Since includes review features, the attention module processes reviews information and generates weights that determine the user’s special attention to different aspects. Following the standard procedures of neural attention networks, there is a subsequent step to normalize aˆu,i with the softmax function, which converts the attention weights to a probabilistic distribution.]
Claim(s) 19 is rejected under 35 U.S.C. 103 as being unpatentable over Liu in view of Ameur as outlined above, further in view of Wu et al., (NPL: "Noise contrastive estimation for one-class collaborative filtering." (2019)).
Regarding Previously Presented Claim 19, Liu in view of Ameur teaches the elements of claim 14 as outlined above, and further teaches:
Liu in view of Ameur does not appear to explicitly teach:
wherein the additional output branch comprises a noise contrastive estimation decoder that reduces popularity bias by increasing a likelihood that the user will interact with the set of candidate content items based on the set of implicit user feedback data.
However, Wu, in combination with Liu and Ameur, teaches the limitations:
wherein the additional output branch comprises a noise contrastive estimation decoder that reduces popularity bias by increasing a likelihood that the user will interact with the set of candidate content items based on the set of implicit user feedback data. (Wu, [P. 3, Section: 3.1] “Noise Contrastive Estimation in Recommendation Noise-Contrastive Estimation (NCE) [5] learns to discriminate between the observed data and some artificially generated noise. …, In implicit feedback recommendation tasks, we only explicitly observe positive observations for each user, which makes NCE an ideal tool to estimate user preferences without explicitly assuming unobserved interactions are negative samples as done in most OC-CF methods. The simple idea driving NCE is that it adversarially trains to maximize prediction probability of the observed user preferences while minimizing the prediction probability of negative samples drawn from a (usually) popularity biased noise distribution.” [P. 4, Section: 3.3] “3.3 An NCE Item Embedding Hyperparameter The optimal solution of NCE as shown in Equation (8) penalizes the influence of popular items on the user and item representation.” [P. 3, Col. 2, Section: 3.1] “In the multi-user collaborative filtering setting, the full objective function ℓ corresponds to a summation over each independent user, where the item embeddings are shared by all users: … (equation (6)) [P. 4, Col. 2, Section 3.4] “Using optimal user
U
*
and item
V
*
embeddings from Equation (10), we can predict unobserved interactions with a simple dot product
U
*
V
*
T
. Hence, the simple method of NCE-SVD can be used as a recommendation algorithm by itself. …,Then, we maximize a user-item reweighted version of the PLRec objective as follows: (See Equation (13)).”)
Accordingly, it would have been obvious to a person having ordinary skill in the art, before the effective filing date of the claimed invention, having the combination of Liu and Ameur before them, to incorporate the Noise Contrastive Estimation for One-Class Collaborative Filtering as taught by Wu. One would have been motivated to make such a combination in order to enable a scalable and efficient recommendation system that reduce bias, handles cold-start scenarios without relying on side information and maintains robust performance in almost all metrics (Wu [Conclusion]).
Claim(s) 20 is rejected under 35 U.S.C. 103 as being unpatentable over Liu in view of Ameur as outlined above, and further in view of Ni et al., (Pub. No.: US 20210065278 A1).
Regarding Original Claim 20, Liu in view of Ameur teaches the elements of claim 16 as outlined above, and further teaches:
As described above, Liu teaches a network framework, which includes an encoding attention layers for learning the features of review information based on hierarchical attention (i.e., review encoder). Liu in view of Ameur does not appear to explicitly teach:
wherein the review encoder comprises one or more bi-directional LSTM (long short-term) neural networks.
However, Ni, in combination with Liu and Ameur, teaches the limitation:
wherein the review encoder comprises one or more bi-directional LSTM (long short-term) neural networks. (Ni, [0057] “Any suitable neural network technique can be used to perform the encoding at block 220 in accordance with the embodiments described herein. Examples of suitable neural network techniques include, but are not limited to, Bidirectional Long Short-term Memory (BiLSTM), Convolutional Neural Network (CNN), Bidirectional Encoder Representations from Transformers (BERT), etc. Further details regarding block 220 are described above with reference to FIG. 1 and will now be described below with reference to FIG. 3.” Further see [0027].)
Accordingly, it would have been obvious to a person having ordinary skill in the art, before the effective filing date of the claimed invention, having the combination of Liu and Ameur before them, to incorporate the system/method for implementing a recommendation system using an asymmetrically hierarchical network (AHN) as taught by Ni. One would have been motivated to make such a combination in order to incorporate a gating mechanism within a multiple attention weight modules to improve model performance. Doing so would enable the dynamic and hierarchical construction of effective user and item embeddings based on the most relevant information, thereby improving personalized recommendation accuracy and model interpretability (Ni [0016]).
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
NPL: Tay, Yi, et al. " Multi-Pointer Co-Attention Networks for Recommendation." (2018).
NPL: Chen, Chong, et al. "Neural attentional rating regression with review-level explanations." (2018).
NPL: Wu, Libing, et al. "A context-aware user-item representation learning for item recommendation." (2019).
NPL: Gadzicki, Konrad. "Early vs late fusion in multimodal convolutional neural networks." (2020).
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SADIK ALSHAHARI whose telephone number is (703)756-4749. The examiner can normally be reached Monday - Friday, 9 a.m. 6 p.m. ET.
Examiner interviews are available via telephone, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Li Zhen can be reached on (571) 272-3768. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/S.A.A./Examiner, Art Unit 2121
/Li B. Zhen/Supervisory Patent Examiner, Art Unit 2121