Problem Statement
An e-commerce company wants to recommend products to its users.
The company has collected only transaction data in the past. The
training dataset has only 3 columns - user_id, Product bought and
Order value of the product. Using this dataset, predict for all the users
in the training dataset, the top 3 categories that the user might buy
from.
Training dataset sample
aov = Order Value of the product
category = Product Category where the purchase was made
What do you need to predict?
For each user, predict the top 3 probable product categories that they
may purchase from, in the future.
Timeline
DEADLINE EXTENDED
21 DAYS LEFT
SUBMISSIONS OPEN SAT JUN 26
LAST DATE SAT JUL 17
Training Data
This file contains the detailed purchasing history for every user. It has
order value and the category of the product.
Training Data Target
This file contains data for some users about the category of items they
bought in future.
Test Data
This file contains the detailed purchasing history for some users. It has
the order value and the category of the product. You have to predict
the top 3 categories that the users with these user_ids will purchase
from in the future.
Evaluation
Measurements will be based on mean relevance rank
(mrr) and precision. Both the measurements are explained here.
Mean Relevance Rank
User Reciproca |
Products in the order shown Product bought
id Rank
E-readers, Kitchen Supplies, Phones, Comics,
1 1/3
Technology books Technology Books
2 Phones, Comics, Fruits None 0
3 Groceries, Fruits, Phones None 0
Fruits, Home Decor,
4 Phones, Home Decor, readers 1/2
readers
Home Decor, Home Furnishings,
5 Phones, Books, Fruits 0
Kitchen Supplies
Technology Books is the 3rd top prediction for user_id 1 and that is
the one bought by the user - hence the reciprocal rank is 1/rank of the
right prediction which is 1/3. If there is more than one product
matching, the reciprocal rank still takes only the first matching product.
For instance - user_id 4 though both Home Decor and readers are
matching, the first match product is at position 2 and hence the
reciprocal rank is 1/2. Once we get the reciprocal ranks, we do an
average of the reciprocal ranks to get the mean reciprocal rank.
Final MRR
= ( ⅓ + ½ ) / 2 = 0.41666
Precision or Accuracy
We first find the Number of products in the prediction in each row that
matches with the number of products of the user_id. We then average
this number across all valid predictions. For the above table, precision
would look like -
User
Products in the order shown Product bought Precision
id
E-readers, Kitchen Supplies, Phones, Comics, Technology
1 1
Technology books Books
2 Phones, Comics, Fruits None NA
3 Groceries, Fruits, Phones None NA
4 Phones, Home Decor, readers Fruits, Home Decor, readers 2
Home Decor, Home Furnishings,
5 Phones, Books, Fruits NA
Kitchen Supplies
For user_id 1, one product matched and for user_id 4, two products
matched. So, accuracy is
number of items that matched / number of unique users with a
prediction.
Here it will be 3/2 = 1.5
Recall in this case, the number of items for which there is a prediction =
2/5 = 0.4
Ready to submit?
Submissions should be made in the same format as the sample
submission provided.
Sample Submission
Submissions should be made in the same format as the sample
provided.
Sample Prediction Dataset
Prediction dataset should be a .csv file with 19,981 rows (and one row for
headers) and the columns user_id and pred3 in the same format as the file
below.