End-to-End Image-Based Fashion
Recommendation
Shereen Elsayed, Lukas Brinkmeyer and Lars Schmidt-Thieme
arXiv:2205.02923v1 [cs.AI] 5 May 2022
Abstract In fashion-based recommendation settings, incorporating the item image
features is considered a crucial factor, and it has shown significant improvements
to many traditional models, including but not limited to matrix factorization, auto-
encoders, and nearest neighbor models. While there are numerous image-based
recommender approaches that utilize dedicated deep neural networks, comparisons
to attribute-aware models are often disregarded despite their ability to be easily
extended to leverage items’ image features. In this paper, we propose a simple yet
effective attribute-aware model that incorporates image features for better item rep-
resentation learning in item recommendation tasks. The proposed model utilizes
items’ image features extracted by a calibrated ResNet50 component. We present
an ablation study to compare incorporating the image features using three different
techniques into the recommender system component that can seamlessly leverage
any available items’ attributes. Experiments on two image-based real-world recom-
mender systems datasets show that the proposed model significantly outperforms all
state-of-the-art image-based models.
Shereen Elsayed
Information Systems and Machine Learning Lab, University of Hildesheim, Germany e-mail:
[email protected]
Lukas Brinkmeyer
Information Systems and Machine Learning Lab, University of Hildesheim, Germany e-mail:
[email protected]
Lars Schmidt-Thieme
Information Systems and Machine Learning Lab, University of Hildesheim, Germany e-mail:
[email protected]
1
2 Shereen Elsayed, Lukas Brinkmeyer and Lars Schmidt-Thieme
1 Introduction
In recent years, recommender systems became one of the essential areas in the
machine learning field. In our daily life, recommender systems affect us in one way
or another, as they exist in almost all the current websites such as social media,
shopping websites, and entertainment websites. Buying clothes online became a
widespread trend nowadays; websites such as Zalando and Amazon are getting
bigger every day. In this regard, item images play a crucial role, no one wants to buy
a new shirt without seeing how it looks like. Many users also can have the same taste
when they select what they want to buy; for example, one loves dark colors most of
the time, others like sportswear, and so on. In fashion recommendation adding the
items images to the model has proven to show a significant lift in the recommendation
performance when the model is trained not only on the relational data describing
the interaction between users and items but also on how it looks. It can massively
help the model recommend other items that look similar or compatible. Some recent
works tackled this area in the last years, particularly adding item images into the
model with different techniques.
Following this research direction, we propose a hybrid attribute-aware model that
relies on adding the items’ image features into the recommendation model.
The contributions through this works can be specified as follows;
• We propose a simple image-aware model for item recommendation that can
leverage items’ image features extracted by a fine-tuned ResNet50 component
[3].
• We conducted extensive experiments on two benchmark datasets, and results show
that the proposed model was able to outperform more complex state-of-the-art
methods.
• We conduct an ablation study to analyze and compare three different approaches
to include the items images features extracted using the ReNet50 into the recom-
mender model.
2 Related work
Many image-based recommender systems were proposed in the last few years. They
are becoming more popular, and their applications are getting wider, especially in
fashion e-commerce websites. Many of the proposed models in the literature relied
on using pre-trained networks for items images features extraction. In 2016 an
essential state-of-the-art model was proposed, VBPR [4]. It uses the BPR ranking
model [13] for prediction and takes the visual item features into account. They
use a pre-trained CNN network to extract the item features; these features pass
through a fully connected layer to obtain the latent embedding. Another model
which uses more than one type of external information is JRL [15]. It incorporates
three different information sources (user reviews, item images, and ratings) using
End-to-End Image-Based Fashion Recommendation 3
a pre-trained CaffeNet and PV-DBOW model [8]. While in 2017, Qiang et al. [9]
proposed the DeepStyle model, where their primary assumption is that the item
representation consists of style and item category representations extracting the item
image features via a CaffeNet model. To get the style features, they subtract latent
factors representing the category. Additionally, a region-guided approach (SAERS)
[5] introduced the items’ visual features using AlexNet to get general features and
utilizing a ResNet-50 architecture for extracting semantic features representing the
region of interest in the item image. Before semantic features are added to the global
features, an attention mechanism using the users’ preferences is applied. The final
item embedding becomes the semantic features combined with the global features.
The image networks used for item image feature extraction can also be trained
end-to-end with the recommender model; the most popular model applied this
technique is the DVBPR [6] powerful model proposed in 2017 that incorporates
visual item features. It does two tasks; the first task is training the BPR recommender
model jointly with a CNN structure to extract the item’s image pixel-level features.
The second task uses Generative Adversarial Networks (GANs) to generate new item
images based on user preferences.
Attribute-aware recommender system models are a family of hybrid models that
can incorporate external user and item attributes. Theoretically, some of these models
can be extendable to image-based settings by carefully converting the raw image
features into real-valued latent features, which can be used as the item’s attributes.
Recently, (CoNCARS) [1] model was proposed that takes the user and item one-
hot- encoded vectors as well as the user/item timestamps. The mode utilizes a
convolution neural network (CNN) on top of the interaction matrix to generate the
latent embeddings. Parallel work by Rashed et al. proposed an attribute-aware model
(GraphRec) [11] that appends all the users’ and items’ attributes on-hot-encoded
vectors. It extracts the embeddings through neural network layers that can capture
the non-linearity in the user-item relation.
In the literature, using attribute-aware models has been mainly set aside for
image-based items ranking problems. Hence, in this paper, we propose a simple
image-aware model that utilizes the latent image features as item attributes. The
experimental results show that the proposed model outperforms current complex
image-based state-of-the-art models.
3 Methodology
3.1 Problem Definition
In image-based item recommendation tasks, there exist a set of 𝑀 users U :=
{𝑢 1 , · · · , 𝑢 𝑀 }, a set of 𝑁 items I := {𝑖1 , · · · , 𝑖 𝑁 } with their images 𝑋𝑖 ∈
R 𝑁 ×(𝐿×𝐻 ×𝐶) of dimensions 𝐿 × 𝐻 and 𝐶 channels, and a sparse binary interac-
tion matrix 𝑅 ∈ R 𝑀 ×𝑁 that indicate user’s implicit preferences on items based on
historical interactions.
4 Shereen Elsayed, Lukas Brinkmeyer and Lars Schmidt-Thieme
The recommendation task’s primary goal is to generate a ranked personalized
short-list of items to users by estimating the missing likelihood scores in 𝑅 while
considering the visual information that exists in the item images.
3.2 Proposed model
The proposed model consists of an image features extraction component and an
attribute-aware recommender system component that are jointly optimized.
3.2.1 Recommender System Component
Inspired by the GraphRec model [11], the recommender system component utilizes
the user’s one-hot encoded input vector and concatenates the external features of
the items directly to the items’ one-hot input vectors. These vectors are then fed to
their independent embedding functions 𝜓𝑢 : R 𝑀 → R𝐾 and 𝜓𝑖 : R ( 𝑁 +𝐹 ) → R𝐾 as
follows:
𝑧 𝑢 = 𝜓𝑢 (𝑣𝑢 ) = 𝑣𝑢 𝑊 𝜓𝑢 + 𝑏 𝜓𝑢 (1)
𝑧𝑖 = 𝜓𝑖 (𝑣𝑖 ) = 𝑐𝑜𝑛𝑐𝑎𝑡 (𝑣𝑖 , 𝜙(𝑥𝑖 ))𝑊 𝜓𝑖 + 𝑏 𝜓𝑖 (2)
where 𝑊 𝜓𝑢 , 𝑊 𝜓𝑖 are the weight matrices of the embedding functions, and 𝑏 𝜓𝑢 , 𝑏 𝜓𝑖
are the bias vectors. 𝑣𝑢 , 𝑣𝑖 represents the user and item one-hot encoded vectors.
Additionally, 𝜙(𝑥𝑖 ) represents the features extraction component that embeds an
item’s raw image 𝑥𝑖 to a latent feature vector of size 𝐹.
After obtaining the user and item embeddings, the final score is calculated using
the dot-product of the two embedding vectors, 𝑦ˆ 𝑢𝑖 = 𝑧 𝑢 · 𝑧𝑖 to give the final output
score representing how much this user 𝑢 will tend to like this item 𝑖. The final score
is computed via a sigmoid function; 𝜎( 𝑦ˆ𝑢𝑖 ) = 1/1 + 𝑒 ( 𝑦𝑢𝑖
ˆ ) , to limit the score value
from 0 → 1. The model target is defined as 𝑦 𝑢𝑖 which is for implicit feedback either
0 or 1; (
1, observed item;
𝑦 𝑢𝑖 = (3)
0, otherwise
Given the users-items positive and negative interactions 𝐷 +𝑠 , 𝐷 −𝑠 , and the output
score 𝑦ˆ 𝑢𝑖 of the model and the original target 𝑦 𝑢𝑖 , we use negative sampling for
generating unobserved instances and optimize the negative log-likelihood objective
function ℓ( 𝑦ˆ ; 𝐷 𝑠 ) using ADAM optimizer, which can be defined as;
∑︁
− 𝑦 𝑢𝑖 log ( 𝑦ˆ 𝑢𝑖 ) + (1 − 𝑦 𝑢𝑖 ) (1 − log ( 𝑦ˆ 𝑢𝑖 )) (4)
(𝑢,𝑖) ∈𝐷𝑠+ 𝐷𝑠−
Ð
End-to-End Image-Based Fashion Recommendation 5
3.2.2 Extraction of Image Features
To extract the latent item’s image features, we propose using the ResNet50 component
for combining the raw image features. To refine the image features further and get
better representation, we can jointly train the whole image network simultaneously
with the recommender model. However, it will require a considerable amount of
memory and computational power to load and update these parameters. In this case,
ResNet50 consists of 176 layers with around 23 million parameters. To mitigate
this problem, we propose ImgRec End-to-End (ImgRec-EtE), where we utilize a
ResNet50 [3] pre-trained on ImageNet dataset [2] and jointly train part of the image
network to be updated with the recommender model, and at the same time, benefit
from starting with the pre-trained weights. As shown in Figure 1, we selected the
last 50 layers to be updated and fix the first 126 layers. Furthermore, we added an
additional separate, fully connected layer to fine-tune the image features extracted
by the ResNet50. This layer will be trained simultaneously with the recommender
model. Moreover, this additional layer makes the image features more compact and
decreases its dimensionality further to match the user latent embedding. Thus the
features extraction function 𝜙(𝑥𝑖 ) for ImgRec-EtE can be defined as follows;
𝜙(𝑥𝑖 ) := 𝑅𝑒𝐿𝑈 (𝑅𝑒𝑠𝑁𝑒𝑡50(𝑥𝑖 )𝑊 𝜙 + 𝑏 𝜙 ) (5)
3.3 Training Strategy
To increase the speed of the training process, we used a two-stage training protocol.
Firstly, we train the model by fixing the image network pre-trained parameters and
updating the recommender model parameters until convergence. After obtaining the
best model performance in the first stage, we jointly learn and fine-tune the last 50
layers of the image network further with the recommender model. This methodology
allowed us to fine-tune the model after reaching the best performance given the pre-
trained image network weights, also it saves time and computational power while
achieve superior prediction performance compared to using only the fixed pre-trained
parameters of the image network.
4 Experiments
Through the experimental section, we aim to answer the following research questions:
RQ1 How does the proposed model fair against state-of-the-art image-based models?
RQ2 What is the best method for adding images features to the model?
6 Shereen Elsayed, Lukas Brinkmeyer and Lars Schmidt-Thieme
Dot-Product
User Embedding ( ) Item Embedding ( )
Concat.
0 0 1 ..... 0 1 0 ..... 0.5 0.7 0.9 .....
Item Fully Connected
Pre-trained ResNet50 Arch.
by Bendjillali et al.
Items Images
126 Non-trainable 50 trainable
layers layers
Fig. 1 End-to-End ImgRec Architecture
Table 1 Datasets statistics
Dataset Users Items Categories Interactions
Amazon fashion 45184 166270 6 267635
Amazon men 34244 110636 50 186382
4.1 Datasets
We chose two widely used image-based recommendation datasets Amazon fashion
and Amazon men, introduced by McAuley et al. [10]. Amazon fashion was collected
from six different categories, "men/women’s tops, bottoms, and shoes," while Ama-
zon men contains all subcategories (gloves, scarves, sunglasses, etc.). The number
we mention of users-items of the Fashion dataset is different from the ones stated
in the main paper. However, we contacted the authors1 and the numbers in Table 1
were found to be the correct ones.
1 https://github.com/kang205/DVBPR/issues/6
End-to-End Image-Based Fashion Recommendation 7
Table 2 Comparison of AUC scores with 100 negative samples per user, the bold results represent
the best performing model and we underline the second best result.
Datasets Interactions
PopRank WARP BPR-MF FM
Amazon Fashion 0.5849 0.6065 0.6278 0.7093
Amazon Men 0.6060 0.6081 0.6450 0.6654
Datasets Interactions+Image Features
VisRank VBPR DVBPR ImgRec-EtE
Amazon Fashion 0.6839 0.7479 0.7964 0.8250
Amazon Men 0.6589 0.7089 0.7410 0.7899
Table 3 Comparison of AUC scores with 500 negative samples per user.
Datasets Interactions Interactions+Image Features
PopRank BPR-MF VBPR DeepStyle JRL SAERS ImgRec-EtE
Amazon Fashion 0.5910 0.6300 0.7710 0.7600 0.7710 0.8161 0.8250
4.2 Evaluation Protocol
To evaluate our proposed models, the data is split into training, validation, and test
sets using the leave-one-out evaluation protocol as in [6, 5, 4]. However, we used
two different numbers of negative samples in the evaluation protocol for the direct
comparison against the published results of the state-of-the-art models DVBPR [6],
and SAERS [5] because both models originally used a different number of samples,
and the source code of SAERS is not available.
For the direct comparison against DVBPR, we sample 100 negative samples (I 𝑡 )
and one positive item 𝑖 for each user. On the other hand, for direct comparison against
our second baseline SAERS [5] we sample 500 negative items (I 𝑡 ) and one positive
item 𝑖. To ensure our results’ consistency, we report the mean of each experiment’s
five different trials.
For evaluation, we report the Area Under the Curve (AUC) as it is the primary
metric in all of our baselines’ papers, further more it is persistent to the number of
negative samples used [7]:
1 ∑︁ 1 ∑︁
𝐴𝑈𝐶 = ( 𝑦ˆ 𝑢𝑖 > 𝑦ˆ 𝑢 𝑗 ) (6)
|U| |I 𝑡 |
𝑢 ∈U 𝑡 𝑖, 𝑗 ∈I
8 Shereen Elsayed, Lukas Brinkmeyer and Lars Schmidt-Thieme
4.3 Baselines
We compared our proposed methods to the published results of the state-of-the-art
image-based models DVBPR and SAERS. We also compared our results against a
set of well-known item recommendation models that were used in [6, 5].
• PopRank: A naive popularity-based ranking model.
• WARP [14]: A matrix factorization model that uses Weighted Approximate-Rank
Pairwise (WARP) loss.
• BPR-MF [13]: A matrix factorization model that uses the BPR loss to get the
ranking of the items.
• VisRank: A content-based model that utilizes the similarity between CNN fea-
tures of the items bought by the user.
• Factorization Machines (FM) [12]: A generic method that combines the benefits
of both SVM and factorization techniques using pair-wise BPR loss.
• VBPR [4]: A model that utilizes items; visual features, using pre-trained CaffeNet
and a BPR ranking model.
• DeepStyle [9]: A model that uses the BPR framework and incorporates style fea-
tures extracted by subtracting category information from CaffeNet visual features
of the items.
• JRL[15]: A model that incorporates three different types of item attributes. In
this case, we considered only the visual features for comparison purposes.
• DVBPR [6]: State-of-the-art image-based model that adds the visual features
extracted from a dedicated CNN network trained along with a BPR recommender
model.
• SAERS [5]: State-of-the-art image-based model that utilizes the region of inter-
ests in the visual images of the items while also considering the global extracted
features from a dedicated CNN to get the final items representations.
4.4 Comparative study against state-of-the-art image-based models
(RQ1)
Since the DVBPR baseline conducted their results on Amazon men and Amazon
fashion datasets, we compared our results directly to both datasets’ published results.
On the other hand, the SAERS model used only the Amazon fashion dataset, so we
only report the results for this dataset using 500 negative samples per user. Table
2 illustrates the ImgRec-EtE obtained results against VBPR and DVBPR results.
The proposed model ImgRec-EtE represents the best performance on both men and
fashion datasets. It shows a 2.5% improvement over the fashion dataset DVBPR
reported performance and a 4.8% improvement for the men dataset. The results
show consistent AUC values regardless of the number of negative samples, as per
the recent study by Krichene et al. [7]. Table 3 demonstrates the comparison against
the DeepStyle, JRL and SAERS models. The proposed model ImgRec-EtE represents
End-to-End Image-Based Fashion Recommendation 9
Table 4 Comparison of AUC scores with 100 negative samples per user, between the three ways
of incorporating the image features.
Datasets ImgRec-Dir ImgRec-FT ImgRec-EtE
Amazon Fashion 0.7770 0.8090 0.8250
Amazon Men 0.7363 0.7735 0.7899
the best performance on the fashion dataset. Despite its simplicity, the model has
achieved an AUC of 0.825, which shows an improvement over the complex state-or-
the-art SAERS model with 0.9%.
4.5 Ablation Study (RQ2)
Besides obtaining the items features in an end-to-end fashion, it is worth mentioning
that we tried other methods to incorporate the images’ features. Firstly in (ImgRec-
Dir), we directly concatenate the image features extracted using the output of the
next to last fully connected layer of a pre-trained ResNet50 to the one-hot encoded
vector representing the item. On the other hand (ImgRec-FT) passes the features
extracted using the pre-trained network to a fine-tuning layer that is trained with the
recommender model and obtain better item representation. Subsequently, the item’s
image latent features are concatenated to the item one-hot encoded vector to form
one input vector representing the item. As shown in Table 4 The images’ features
had a varying effect depending on how they were added to the model; ImgRec-Dir
achieved an AUC of 0.77 on the Amazon fashion dataset and 0.736 on the Amazon
men dataset. While looking into ImgRec-FT performance after adding the fine-tuning
layer, we can see an improvement of 3.2% on the Amazon fashion dataset and 1.6%
on the Amazon men dataset performances, which shows high competitiveness against
the state-or-the-art models while having a much lower computational complexity.
Finally, ImgRec-EtE, which jointly trains part of the ResNet50 simultaneously with
the model, positively impacted the results with further improvement of 1.6% over
the ImgRec-FT performance on both datasets.
4.6 Hyperparameters
We ran our experiments using GPU RTX 2070 Super and CPU Xeon Gold 6230 with
RAM 256 GB. We used user and item embedding sizes of 10 and 20 with Linear
activation function for both datasets. We applied grid search on the learning rate
between [0.00005 and 0.0003] and the L2-regularization lambda between [0.000001
and 0.2]. The best parameters are 0.0001 and 0.1 for ImgRec-Dir and ImgRec-FT.
While in ImgRec-EtE case, the best L2-regularization lambda is 0.000001 for phase1
10 Shereen Elsayed, Lukas Brinkmeyer and Lars Schmidt-Thieme
(fixed-weights) and 0.00005 for phase 2 (joint training). The features fine-tuning
layer, the best-selected embedding size is 150 with ReLU activation function. ImgRec
codes and datasets are available at https://github.com/Shereen-Elsayed/ImgRec.
5 Conclusion
In this work, we propose an image-based attribute-aware model for items’ person-
alized ranking with jointly training a ResNet50 component simultaneously with the
model for incorporating image features into the recommender model. Adding the
image features showed significant improvement in the model’s performance. ImgRec-
EtE shows superior performance to all image-based recommendation approaches.
Furthermore, we conducted an ablation study to compare different approaches of
adding the features to the model; direct features concatenation, adding a fine-tuning
fully connected layer, and jointly training part of the image network.
6 Acknowledgements
This work is co-funded by the industry Project “IIP-Ecosphere: Next Level Ecosphere
for Intelligent Industrial Production”.
References
1. Costa, F. S. d., and Dolog, P. Collective embedding for neural context-aware recommender
systems. In Proceedings of the 13th ACM Conference on Recommender Systems (2019),
pp. 201–209.
2. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. ImageNet: A Large-Scale
Hierarchical Image Database. In CVPR09 (2009).
3. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition.
In Proceedings of the IEEE conference on computer vision and pattern recognition (2016),
pp. 770–778.
4. He, R., and McAuley, J. Vbpr: visual bayesian personalized ranking from implicit feedback.
arXiv preprint arXiv:1510.01784 (2015).
5. Hou, M., Wu, L., Chen, E., Li, Z., Zheng, V. W., and Liu, Q. Explainable fashion recom-
mendation: A semantic attribute region guided approach. arXiv preprint arXiv:1905.12862
(2019).
6. Kang, W.-C., Fang, C., Wang, Z., and McAuley, J. Visually-aware fashion recommendation
and design with generative image models. In 2017 IEEE International Conference on Data
Mining (ICDM) (2017), IEEE, pp. 207–216.
7. Krichene, W., and Rendle, S. On sampled metrics for item recommendation. In Proceedings
of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
(2020), pp. 1748–1757.
8. Le, Q., and Mikolov, T. Distributed representations of sentences and documents. In Interna-
tional conference on machine learning (2014), pp. 1188–1196.
End-to-End Image-Based Fashion Recommendation 11
9. Liu, Q., Wu, S., and Wang, L. Deepstyle: Learning user preferences for visual recommen-
dation. In Proceedings of the 40th International ACM SIGIR Conference on Research and
Development in Information Retrieval (2017), pp. 841–844.
10. McAuley, J., Targett, C., Shi, Q., and Van Den Hengel, A. Image-based recommendations
on styles and substitutes. In Proceedings of the 38th international ACM SIGIR conference on
research and development in information retrieval (2015), pp. 43–52.
11. Rashed, A., Grabocka, J., and Schmidt-Thieme, L. Attribute-aware non-linear co-
embeddings of graph features. In Proceedings of the 13th ACM Conference on Recommender
Systems (2019), pp. 314–321.
12. Rendle, S. Factorization machines, icdm, 2010.
13. Rendle, S., Freudenthaler, C., Gantner, Z., and Schmidt-Thieme, L. Bpr: Bayesian
personalized ranking from implicit feedback. arXiv preprint arXiv:1205.2618 (2012).
14. Weston, J., Bengio, S., and Usunier, N. Wsabie: Scaling up to large vocabulary image
annotation.
15. Zhang, Y., Ai, Q., Chen, X., and Croft, W. B. Joint representation learning for top-n
recommendation with heterogeneous information sources. In Proceedings of the 2017 ACM
on Conference on Information and Knowledge Management (2017), pp. 1449–1458.