Using a Language Model in a Kiosk Recommender System at
Fast-Food Restaurants
Eduard Zubchuk Dmitry Menshikov Nikolay Mikhaylovskiy
ezubchuk@[Link] [Link]@[Link] nickm@[Link]
Higher IT School of Tomsk State Higher IT School of Tomsk State Higher IT School of Tomsk State
University & NTR Labs University & NTR Labs University & NTR Labs
Tomsk, Russia Moscow, Russia Moscow, Russia
ABSTRACT a vectorizer that turns a shopping cart into a vector, and a classifier
Kiosks are a popular self-service option in many fast-food restau- of such vectors, trained separately. Second, we show that using the
rants, they save time for the visitors and save labor for the fast-food FastText model [2] as a vectorizer and a fully connected neural net-
arXiv:2202.04145v1 [[Link]] 8 Feb 2022
chains. In this paper, we propose an effective design of a kiosk shop- work (Multi-Layer Perceptron) as a classifier delivers competitive
ping cart recommender system that combines a language model results for a fast-food kiosk shopping cart recommender system.
as a vectorizer and a neural network-based classifier. The model
performs better than other models in offline tests and exhibits per- 2 PROBLEM STATEMENT
formance comparable to the best models in A/B/C tests. The order placement process in a self-service kiosk in a fast-food
restaurant usually includes at least menu browsing and checkout.
CCS CONCEPTS During the browsing phase (see Figure 1), the visitor adds items
• Information systems → Recommender systems; Clustering of interest to the shopping cart, while the checkout is aimed for
and classification; • Computing methodologies → Information validation of the order and payment. There are a few options for
extraction. placing the recommendations during the purchase process. Our
effort was focused on recommendations at the checkout phase. Fig-
KEYWORDS ure 2 represents the shopping cart screen. Green (best viewed on
screen) section under the line “Add to your order” is a recommen-
natural language processing, short text classification, neural net- dation section that includes four items from the main menu. Users
work can add these items before proceeding to payment. It is important
to note that in this layout the order of the recommended items is
1 INTRODUCTION not important; separate A/B/C tests have shown that there is no
Kiosks are a popular self-service option in many fast-food restau- statistically significant differences induced by the order of the items
rants, they save time for the customers and save labor for the fast- recommended.
food chains. With the advent of COVID-19 pandemic, minimizing Our task was to recommend four items from the menu based
in-person interaction drives faster adoption of kiosks by fast-food on the behavior of the visitors and show the recommendations
chains. A recommender system for kiosks should allow increasing to the visitor at the bottom of the kiosk screen in the shopping
revenue per visitor, by creating a unique user experience whereby: cart, so that the visitor could add one or more of the recommended
• the fast-food restaurant visitor would be regularly exposed items to the order in a single tap. The key metric selected by the
to the recommendations; customer was the gross margin percentage gained from items sold
• recommendations will stimulate the purchase; by recommendation 𝑋 , i.e. gross margin of the items added from
• visitors’ loyalty will not degrade due to intrusiveness. the recommender block 𝐺𝑟𝑒𝑐 divided by total gross margin 𝐺𝑡𝑜𝑡𝑎𝑙
of the test segment during the selected timeframe (say, 1 day or 1
In this work, we describe the design of one of the pilot recom- week):
mender systems for kiosks of a fast-food chain, a shopping cart
recommender system. The goal of this recommender system is to 𝑋 = 𝐺𝑟𝑒𝑐 /𝐺𝑡𝑜𝑡𝑎𝑙 ∗ 100,
recommend items based on the interactions of the visitor with the
where gross margin is calculated as
kiosk during the session, specifically just before checkout. The sys-
tems piloted were compared using A/B/C tests measuring the gross 𝐺 = 𝑅 − 𝐶 −𝑇,
margin gained from the items sold by recommendation. Thus the and R is the Revenue, C is the Cost of the goods sold, T is the Tax.
navigational function of a recommender system was out of consid- The customer has organized an online competition for a few
eration in this test. The validity of this measurement approach was teams developing recommender systems. The pool of models also
supported by the fact that virtually no visitors returned from the included the customer’s simple baseline model that implements
shopping cart to general items selection. The system was piloted in several straightforward business rules such as "If the order contains
100+ fast-food restaurants for a prolonged period of time. a burger, then recommend a drink", "If the order contains a burger
and a drink, then recommend french fries" etc. Thus we could
1.1 Our contribution freely compare, analyze and utilize in model training not only
The contribution of our work is twofold. First, we propose a novel our historical data but the data of our competitors as well (the
architecture for a recommender system. The architecture consists of same was true for the other competing teams). We had access to
Eduard Zubchuk, Dmitry Menshikov, and Nikolay Mikhaylovskiy
the PostgreSQL database containing historical order data. At our
disposal were: Order ID, session ID, restaurant ID, timestamp, and
the set of purchased dishes.
3 RELATED WORK
The most common approach to building recommender systems is
collaborative filtering (CF), first proposed, likely, by Goldberg et
al. [5]. It is based on discovering the items’ or users’ similarities
from the user-item interaction matrix. See, for example, Su and
Taghi [10] for a survey of older works. CF is often accompanied
by items and/or users features integration and matrix factorization
techniques such as SVD, PCA, and others. See, for example, Polat
and Wenliang [9] or Vozalis et al. [14].
Various works previously proposed CF for personal recommen-
dations and reducing order time in the fast-food industry. For ex-
ample, Azevedo and Wörndl [1] suggested CF-inspired adaptive
electronic menu for cafes and restaurants aimed to increase visitors’
satisfaction and collect feedback. Chao et. al [4] have also used the
skip-gram technique to retrieve dishes information from restaurant
reviews.
Maia and Ferreira [8] enriched the CF-based food recommen-
dation system by adding ingredients as features and contextual
information such as location. The idea behind this work is an explo-
ration of users’ preferences in conjunction with cultural, national
features derived from users’ location. Recent work by Gupta et al.
Figure 1: General layout of the kiosk interface. [6] suggested an integrated solution for cafes and restaurants. The
user must enter their basic personal details so that the system could
estimate his/her mood and make a personal recommendation based
on their current mental condition.
The work by Wang et al. [15] is likely the closest to ours in the
terms of setting. In their Drive-thru recommendation service for
Fast Food restaurants, the authors deal with session-based data
and model it as a sequence of dishes added to the shopping cart.
They utilize a transformer neural network to model dependencies
related to the order of the dishes. It is noted the significance of
the contextual data such as time, day of the week, location of the
restaurant, etc., so the paper describes an extra transformer fully
dedicated to the features of context.
Bonnin, Brun, and Boyer [3] have probably first suggested using
a language model in a recommender system, although they did
not go beyond working with Internet navigation artificial corpora.
Valcarce et al. [11, 12] have also explored statistical language models
for recommender systems, although the application area of these
studies differs from described in this paper. Using a vector space
with language models was first suggested by Valcarce et al. [13] in
a neighborhood-based recommender setting. Most recently, Zhang
et al. [17] suggested the use of Pretrained Language Models, such
as BERT, in recommender system, with limited success.
4 THE DATASET AND LIMITATIONS
There are several peculiarities in the data we used that stem from
the specific usage patterns of kiosks in a fast-food chain and result
Figure 2: General layout of the shopping cart in a set of differences from the canonical recommender system
datasets that often assume having historical personalized user-to-
item interactions:
Using a Language Model in a Kiosk Recommender System at Fast-Food Restaurants
Figure 5: The distribution of the dishes sold by revenue.
Figure 3: The distribution of all the purchases into cate-
gories.
Figure 6: The distribution of recommended purchases by the
number of items sold.
Figure 4: The distribution of dishes sold by the number of
items.
50% of the purchases by revenue (Figure 5). The top 20 items account
• no dish ratings are available, and all the feedback is com- for over 92% of the items purchased from recommendations of the
pletely implicit; previous recommender systems. Figure 6 and Figure 7 show the
• all orders are fully anonymous, thus personalized item-to- distribution of the items purchased by recommendations in terms
user recommendations are not feasible; of their number and the revenue associated.
• the number of items in the menu does not exceed 300, and We can note that unlike the majority of recommendation datasets,
there were a few dozen thousand orders per day, which a high density of the item-item matrix provides us with a sufficient
results in a dense item-to-item matrix; amount of data on the one hand but suffers from redundancy and
• a flag pointing out that the item has been purchased by the noise on the other one. The peculiarity of the fast-food restaurant
recommendation was available. is a significant skewness of the dish purchases’ distribution. The
Another aspect of the data is a one or two days delay between major driver-items are burgers, cold drinks and side dishes like
the moment a new dish goes live in the restaurants and the moment french fries, thus the absolute majority of the orders contain items
it becomes available in the database we work with. Thus we had to from the listed three groups. Hence, usage of classic collaborative
deal with "Out of vocabulary" (OOV) dishes during the inference. filtering approaches leads to heavy biasing of recommendations to
Initially, we replaced the OOV item with another one that is as those items.
close as possible according to the Normalized Levenshtein Distance
Metric [16]. Later in this paper, we describe the final approach we 5 SUGGESTED APPROACH
used in production. We focused directly on increasing the gross margin percent and
Let’s provide some exploratory analysis of the dataset. The cus- predicted the items relevant for the current cart. Because of the
tomer’s database categorized all the items into three levels of cate- skewness of the distribution of fast-food chain visitors’ preferences,
gories. The order history available spanned several months. Figure 3 based on the data provided above from the previous recommender
shows the distribution of all the purchases into categories. systems, recommending just the top 8% of the menu satisfies the
It is important to note that the top 20 items of the menu account needs of 90% of visitors and brings 90% of extra income. Considering
for almost 40% of the purchases by number (Figure 4) and almost that fact, we trained a model to perform the classification of the
Eduard Zubchuk, Dmitry Menshikov, and Nikolay Mikhaylovskiy
Figure 7: The distribution of recommended purchases by
revenue.
shopping cart into roughly 20 classes, each class representing an Figure 8: Model train and test losses.
item to recommend, and recommended the top 4 classes predicted
for the shopping cart.
To keep the models up-to-date we performed nightly training
using the sliding fortnight data frame. During the inference, each
6 RESULTS
existing dish in the order contains the dish ID, dish name, and As the project was organized as a live A/B/C test, where several
quantity, so there are two options to deal with them: lookup the developer teams could compete in maximizing the percent of the
dish metadata in the database by ID, or use dish name directly to added gross margin, we had an opportunity to compare our results
"understand" the item. Even though the dish database lookup seems with the results of other participants. Before going live we evaluated
to be a straightforward solution, it does not solve the problem of the models in the offline test. We measured the ability to predict
OOV and does not model inter-dish relationships, which might be the recommended ground truth item, purchased by the user with
useful for understanding the structure of the order. To tackle those Mean Average Precision at k (MAP@1 - MAP@4) [7]. We used a
issues we need a transformation of the dish name into vector space dataset of orders with the successful recommendations collected
of fixed dimension, so that: during a fortnight timeframe, i.e. all the orders contained the item
• semantically similar dishes like “hamburger” and “cheese- which was recommended by any of recommender systems. We also
burger” are located closely, while semantically irrelevant used recommend percent metric, which is calculated as
ones - far away from each other; 𝑟𝑒𝑐 = 𝑂𝑔 /𝑂𝑎 ,
• same for behaviorally similar items, so that one could cluster
together dishes of the main course, drinks, snacks, etc even where 𝑂𝑔 - orders where model guessed next item in top-4 pre-
though their names could be different like “brownie” and dictions, 𝑂𝑎 - all orders. Model metrics are presented in Table 1.
“cherry pie”; Figure 10 demonstrates the behavior of the models in the live A/B/C
• we also need the transformation to be able to accurately test during a twelve days timeframe. As the number of models eval-
estimate the "meaning" of previously unseen dishes (OOV) uated simultaneously was limited, we have replaced models in each
and properly locate them in the vector space. slot from time to time. Thus, for the sake of consistency, we only
FastText [2] fits well for the task because it solves the OOV prob- provide comparative data for a limited timeframe.
lem by splitting the previously unseen words into a set of N-grams The evaluation above shows that while the fastText model ex-
and can be trained on a large amount of data in an unsupervised cels in the offline metrics and significantly outperforms the other
manner. FastText has also shown high efficiency in classification of models measured, in an online test it only outperformed the models
short texts comparable to the shopping cart dish list [19]. Thus, the of other competitors. The other models we have developed per-
model we suggest contains two parts: a vectorizer that transforms formed on par with the model described and even slightly better
the shopping cart into a vector in the vector space, and a classi- on average, although the difference is not statistically significant.
fier, operating with the vectors from the previous step. Training Still, a different model with a more traditional architecture has been
of each part is performed separately. First, we train the vector- chosen for a production run.
izer in an unsupervised manner using all the available orders in The model described in this paper is represented in the diagram
the desired timeframe. Considering the relative consistency of the by the label ‘NTR fasttext + NN model’. ‘NTR other model 1’ and
menu, we used 3 months timeframe. Second, we train a three-layer ’NTR other model 2’ are our models built on different principles
fully connected Neural Network Classifier. The classifier model not described here. As it is seen, all three demonstrate similar
had been trained with categorical cross-entropy loss [18] for 10 performance in the terms of extra gross margin percent, overcoming
epochs. Model train and test losses during the training process are competitors. NTR and baseline models mean and standard deviation
presented in Figure 8. The model structure is presented in Figure 9. are presented in Table 2.
Using a Language Model in a Kiosk Recommender System at Fast-Food Restaurants
Figure 9: Model architecture.
Table 1: Offline metrics of recommender models
Name MAP@1 MAP@2 MAP@3 MAP@4 rec percent
NTR FastText + NN Model 0.405 0.504 0.544 0.563 0.174
NTR Other Model 1 0.285 0.365 0.399 0.418 0.128
NTR Other Model 2 0.340 0.423 0.459 0.477 0.149
Figure 10: Timeline of percent of the added gross margin per day.
7 CONCLUSIONS AND FUTURE WORK better than the other models we have studied in terms of offline
The model suggested in this paper have exhibited good performance metrics, the online metrics difference is not statistically significant.
in real-life A/B/C tests, beating models from any other competitors. Despite the fact that the model beats competitors and demon-
On the other hand, while the model presented here is significantly strates good performance, there still is room for improvement. Some
user preferences may highly depend on the context, which is not
Eduard Zubchuk, Dmitry Menshikov, and Nikolay Mikhaylovskiy
Table 2: NTR and baseline models timeframes mean and [2] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017.
standard deviation Enriching word vectors with subword information. Transactions of the Association
for Computational Linguistics 5 (2017), 135–146.
[3] Geoffray Bonnin, Armelle Brun, and Anne Boyer. 2008. Collaborative filtering
Name Mean Standard Deviation inspired from language modeling. In 2008 First International Conference on the
Applications of Digital Information and Web Technologies (ICADIWT). IEEE, 192–
NTR FastText + NN Model 0.883 0.134 197.
[4] Chih-Yu Chao, Yi-Fan Chu, Yi Ho, Chuan-Ju Wang, and Ming-Feng Tsai. 2016.
NTR Other Model 1 0.895 0.081 Dish Discovery via Word Embeddings on Restaurant Reviews.. In RecSys Posters.
NTR Other Model 2 0.904 0.112 Citeseer.
Baseline Model 0.598 0.048 [5] David Goldberg, David Nichols, Brian M Oki, and Douglas Terry. 1992. Using
collaborative filtering to weave an information tapestry. Commun. ACM 35, 12
Competitor 1 model 0.640 0.073 (1992), 61–70.
Competitor 2 model 0.832 0.033 [6] Manu Gupta, Sriniha Mourila, Sreehasa Kotte, and K Bhuvana Chandra. 2021.
Mood Based Food Recommendation System. In 2021 Asian Conference on Innova-
Competitor 3 model 0.815 0.075 tion in Technology (ASIANCON). IEEE, 1–6.
[7] Kun He, Yan Lu, and Stan Sclaroff. 2018. Local descriptors optimized for average
precision. In Proceedings of the IEEE conference on computer vision and pattern
recognition. 596–605.
[8] Rui Maia and Joao C Ferreira. 2018. Context-aware food recommendation system.
Context-aware food recommendation system (2018), 349–356.
[9] Huseyin Polat and Wenliang Du. 2005. SVD-based collaborative filtering with
privacy. In Proceedings of the 2005 ACM symposium on Applied computing. 791–
795.
[10] Xiaoyuan Su and Taghi M Khoshgoftaar. 2009. A survey of collaborative filtering
techniques. Advances in artificial intelligence 2009 (2009).
[11] Daniel Valcarce. 2015. Exploring statistical language models for recommender
systems. In Proceedings of the 9th ACM Conference on Recommender Systems.
375–378.
[12] Daniel Valcarce, Javier Parapar, and Álvaro Barreiro. 2016. Language models for
collaborative filtering neighbourhoods. In European Conference on Information
Retrieval. Springer, 614–625.
[13] Daniel Valcarce, Javier Parapar, and Álvaro Barreiro. 2017. Axiomatic analysis of
language modelling of recommender systems. International Journal of Uncertainty,
Fuzziness and Knowledge-Based Systems 25, Suppl. 2 (2017), 113–127.
Figure 11: Distribution for coffee (upper) and alcoholic bev- [14] Manolis G Vozalis, Angelos Markos, and Konstantinos G Margaritis. 2010. Col-
erages (lower) consumption over the time of day and the day laborative filtering through SVD-based and hierarchical nonlinear PCA. In Inter-
national Conference on Artificial Neural Networks. Springer, 395–400.
of week. [15] Luyang Wang, Kai Huang, Jiao Wang, Shengsheng Huang, Jason Dai, and Yue
Zhuang. 2020. Context-Aware Drive-thru Recommendation Service at Fast Food
Restaurants. arXiv preprint arXiv:2010.06197 (2020).
[16] Li Yujian and Liu Bo. 2007. A normalized Levenshtein distance metric. IEEE
considered in the model. Obviously, the majority of visitors prefer transactions on pattern analysis and machine intelligence 29, 6 (2007), 1091–1095.
drinking coffee in the morning rather than in the evening. However, [17] Yuhui Zhang, Hao Ding, Zeren Shui, Yifei Ma, James Zou, Anoop Deoras, and
some restaurants are located on the highway, so coffee might be Hao Wang. 2021. Language Models as Recommender Systems: Evaluations and
Limitations. In I (Still) Can’t Believe It’s Not Better! NeurIPS 2021 Workshop.
a good source of energy for drivers in the nighttime. The popu- [18] Zhilu Zhang and Mert R Sabuncu. 2018. Generalized cross entropy loss for
larity of cold desserts such as ice cream or milkshakes decreases training deep neural networks with noisy labels. In 32nd Conference on Neural
Information Processing Systems (NeurIPS).
dramatically in cold time while consumption of alcoholic beverages [19] Eduard Zubchuk, Dmitry Menshikov, and Nikolay Mikhaylovsky. 2021. Efficiency
depends on the day of the week, again, depending on the location of short text classifiers for payment classification. In 2021 International Conference
of the restaurant. Figure 11 demonstrates the demand distribution on Information Technology and Nanotechnology (ITNT). IEEE, 1–4.
for coffee (upper) and alcoholic beverages (lower) over time and
day of week where 0 - Monday, 6 - Sunday. Red color stands for
high demand and blue - for low.
There are many more dependencies that are not obvious but
could be discovered in a latent manner in the process of machine
learning. The model successfully discovers the dish features such as
’main course’, ’drink’, ’side dish’ etc. in a latent way in the process
of unsupervised training. However, feeding the item features to
the model explicitly may also result in a better quality of the pre-
dictions. The basic data exploration demonstrates that the regular
user follows a particular pattern while adding the dishes to the cart:
he/she adds the main course such as a burger, a roll, etc., first, while
desserts and snacks usually reside in the end. We do not exploit this
pattern in our model so far, although it is potentially beneficial. All
the above are directions of the further research and development.
REFERENCES
[1] Paulo Henrique Azevedo Filho and Wolfgang Wörndl. 2015. An Adaptive Elec-
tronic Menu System for Restaurants.. In IntRS@ RecSys. 41–44.