Here’s a beginner-friendly explanation of key evaluation metrics used in
Recommendation Systems — perfect for your PG-DBDA studies.
📊 Evaluation Metrics in Recommendation Systems
These metrics help us measure how good or bad our recommendations are.
Let’s divide them into 2 main categories:
🔹 1. Classification-Based Metrics
Used when you're recommending top-N items (like top 5 movies).
✅ Precision@K
What % of the recommended items were actually relevant?
📌 Formula (simplified):
Precision@K=Relevant items in top KK\text{Precision@K} = \frac{\text{Relevant
items in top K}}{K}
🧠 Example:
If system recommends 5 movies and 3 are actually liked →
Precision@5 = 3/5 = 0.6
✅ Recall@K
What % of all relevant items were recommended?
📌 Formula (simplified):
Recall@K=Relevant items in top KTotal relevant items for user\text{Recall@K}
= \frac{\text{Relevant items in top K}}{\text{Total relevant items for user}}
🧠 Example:
If user liked 4 movies and 2 of them were in the top 5 →
Recall@5 = 2/4 = 0.5
✅ Mean Reciprocal Rank (MRR)
Looks at how early the first correct recommendation appears.
📌 Formula:
MRR=1Rank of first relevant item\text{MRR} = \frac{1}{\text{Rank of first
relevant item}}
🧠 Example:
If first relevant movie is at position 3 →
MRR = 1/3 = 0.33
✅ MAP@K (Mean Average Precision)
Average of precision scores at the rank positions where
relevant items appear.
It rewards systems that place relevant items early in the list.
🧠 Easy Idea:
Higher MAP@K = more useful and ordered recommendations.
✅ NDCG@K (Normalized Discounted Cumulative Gain)
Focuses on the ranking quality of relevant items — relevant ones
should appear higher.
📌 Key Points:
Gives more weight to relevant items that appear early.
NDCG is normalized, so it's always between 0 and 1.
🧠 If your relevant items are at top ranks → NDCG is close to 1.
🔹 2. Rating Prediction Metrics
Used when you're predicting how much a user would like an item (e.g., star
rating).
✅ RMSE (Root Mean Squared Error)
Measures the average difference between predicted and actual
ratings.
RMSE=1n∑(predicted−actual)2\text{RMSE} = \sqrt{ \frac{1}{n} \sum ( \
text{predicted} - \text{actual})^2 }
🧠 Smaller RMSE = Better predictions
✅ MAE (Mean Absolute Error)
Average of the absolute differences between predicted and actual
ratings.
MAE=1n∑∣predicted−actual∣\text{MAE} = \frac{1}{n} \sum |\text{predicted}
- \text{actual}|
🧠 Easier to understand, but RMSE penalizes big errors more.
🎯 Summary Table:
Metric Best for What it tells you
Precision@K Top-N % of recommendations that were useful
Recall@K Top-N % of all relevant items you recommended
MRR Ranking Position of first correct recommendation
MAP@K Ranking Average precision for all relevant items
NDCG@K Ranking Ranking quality (early relevant items)
RMSE Rating prediction How far predictions are from actual
MAE Rating prediction Simple error measurement
Would you like a PDF cheat sheet, MCQs, or diagram for these?
Absolutely! Let’s break down Precision@K in very simple terms using a
relatable real-life analogy 🌟
🎯 What is Precision@K?
Precision@K measures how many of the top K recommendations
are actually relevant to the user.
"K" = the number of items you recommended (like top 5)
Precision = % of those that the user liked or found useful
🧵 Analogy: Your Friend Recommends You
Restaurants
Imagine:
You ask your foodie friend: "Hey, suggest 5 good restaurants for my
weekend!"
They give you a list of 5 restaurants — these are your Top-K
recommendations (here, K = 5).
After trying them, you find that 3 were actually good, and 2 were bad.
📌 Precision@5 =
Relevant items in top 5÷Total recommended (K)=3/5=0.6=60%\text{Relevant
items in top 5} \div \text{Total recommended (K)} = 3 / 5 = 0.6 = 60\%
So, your friend had a Precision@5 = 60%
📘 Real ML Example:
Say a recommendation system gives you Top 10 movies
You actually liked 7 of them
Precision@10 = 7 / 10 = 0.7
💡 Why is it Useful?
Precision@K tells us:
Out of the items we showed, how many were useful?
It’s great when you care about relevance in the shortlist — like
shopping or top search results.
⚠️Important Notes:
High precision ≠ complete coverage (You might miss some relevant
ones)
It doesn’t care about the order (ranking) — only whether the
recommended item was good
Often used with Recall@K for a more complete picture
✅ Summary:
Term Meaning
K Number of items recommended
Relevant Items the user actually likes
Precision@K Relevant items in top-K / K
Would you like a visual example, a graph, or practice MCQs next?
Absolutely! Let’s explain Recall@K in a super simple and intuitive way — just
like we did for Precision@K — with a clear real-life analogy 🌟
🎯 What is Recall@K?
Recall@K tells us:
"Out of all the items the user actually liked, how many did we
successfully recommend in the top K?"
It measures coverage — are we catching most of what the user
wants?
It cares about finding as many relevant items as possible, not how
many we show.
🧵 Analogy: Favorite Chocolates in a Surprise Box
Imagine:
You have 10 favorite chocolates 🍫
A friend gives you a surprise box of 5 chocolates (K = 5)
You open the box and find that 3 of your 10 favorites are inside
📌 Recall@5 =
Number of your favorites in the box÷Total number of your favorites=3/10=0.3=3
0%\text{Number of your favorites in the box} \div \text{Total number of your
favorites} = 3 / 10 = 0.3 = 30\%
So, your friend had a Recall@5 = 30%
📘 Real ML Example:
Let’s say:
A user has 20 songs in their "liked" list
The recommendation system shows them Top 10 songs
Out of those 10, 6 were from the user's liked list
📌 Recall@10 =
6/20=0.3=30%6 / 20 = 0.3 = 30\%
💡 Why is it Useful?
Recall@K tells us: "How many of the user's favorites did we find?"
It’s especially important when we don’t want to miss relevant options —
like in medical diagnosis or news filtering
⚖️Precision@K vs Recall@K (Quick Comparison)
Metric Answers the question… Focus
Of what I recommended, how much was
Precision@K Quality / Accuracy
relevant?
Of all relevant things, how many did I Coverage /
Recall@K
recommend? Completeness
⚠️Key Points:
High recall means you found most relevant items, even if you showed
a lot of irrelevant ones too.
Low recall = you missed many relevant items
✅ Summary:
Term Meaning
K Number of items recommended
Relevant Items the user actually likes
Recall@K Relevant items in top-K / Total relevant items
Would you like a combined Precision vs Recall example, a graph, or MCQs
to practice?
Sure! Let’s understand MRR (Mean Reciprocal Rank) in the same easy and
analogy-based way as Precision@K and Recall@K.
🎯 What is MRR (Mean Reciprocal Rank)?
MRR tells us how early the first correct (relevant) item appears
in the list of recommendations.
📦 Real-Life Analogy: Finding Your Friend’s House in a List
Imagine:
You’re visiting a new city and searching for your friend’s house on Google
Maps.
The search result gives you a ranked list of 5 houses with similar
names.
Your friend’s real house is at position 2 in the list.
📌 Reciprocal Rank = 1 / position = 1 / 2 = 0.5
Now imagine:
You do this for 5 different friends.
You calculate the reciprocal rank for each.
MRR = Average of those 5 reciprocal ranks.
📘 Real ML Example:
Let’s say a recommender system gives you:
User First relevant item rank Reciprocal Rank
A 1 1.00
B 3 1/3 ≈ 0.33
C 2 0.5
📌 MRR =
1+0.33+0.53≈0.61\frac{1 + 0.33 + 0.5}{3} ≈ 0.61
That means, on average, users are finding their first useful recommendation
fairly early in the list.
🧠 Why Is MRR Useful?
It's a ranking-focused metric.
Perfect when you're only interested in the first hit (like Google search,
FAQs, chatbot answers).
It rewards systems that put the best result at the top.
🟩 Summary Table:
Term Meaning
Rank Position of first relevant item in the list
Reciprocal Rank 1 / Rank
MRR Average of reciprocal ranks over all users or queries
🆚 MRR vs Precision vs Recall:
Metric Focus Best When…
You care about quality in
Precision@K How many of top-K are good
shortlist
How many relevant items were You care about not missing
Recall@K
found anything
How early the first good item
MRR You care about rank of first hit
appears
Would you like an MRR coding example, graphical illustration, or MCQs
next?
Absolutely! Let’s explain MAP@K (Mean Average Precision at K) in a
beginner-friendly, analogy-based way — just like we did for Precision@K,
Recall@K, and MRR. 🌟
🎯 What is MAP@K?
MAP@K stands for Mean Average Precision at K.
It measures how well the relevant items are ranked within the
top K recommendations, across multiple users.
So unlike MRR (which looks at only the first correct hit), MAP@K:
Considers multiple relevant items
Rewards the system if relevant items are ranked higher
Averages performance across users
🧵 Real-Life Analogy: Shopping Suggestions
Imagine an online store shows you Top 5 suggested products.
Out of those 5, you actually like 3 of them, and they’re at positions 1, 3, and
4.
Let's break it down:
📍 Precision at Each Hit:
Position Is it relevant? Precision@position
1 ✅ Yes 1/1 = 1.0
2 ❌ No –
Position Is it relevant? Precision@position
3 ✅ Yes 2/3 ≈ 0.67
4 ✅ Yes 3/4 = 0.75
5 ❌ No –
🧮 Average Precision (AP) =
1.0+0.67+0.753≈0.806\frac{1.0 + 0.67 + 0.75}{3} ≈ 0.806
That’s your AP@5 (Average Precision at K = 5).
Now do this for multiple users, and average the results:
📌 MAP@K = Mean of all AP@K scores
📘 Real ML Example:
User Relevant hits (out of top 5) AP@5
A 3 relevant at pos 1,3,4 0.806
B 2 relevant at pos 2,5 0.6
C 1 relevant at pos 1 1.0
✅ MAP@5 = (0.806 + 0.6 + 1.0) / 3 ≈ 0.802
🧠 Why Use MAP@K?
It’s a balanced metric: it checks how many relevant items we return
and how well they are ranked
It rewards systems that put more relevant results at the top
Great for ranking problems like recommendations, search engines, etc.
🆚 MAP@K vs Others:
Metric Focus Best For
How many recommended were
Precision@K Shortlists
good
How much of relevant was
Recall@K Coverage
covered
MRR First correct item’s position FAQs, Search
MAP@K All relevant items + their ranks Recommender systems, search
Metric Focus Best For
ranks
✅ Summary:
Term Meaning
K Top K recommendations considered
Precision@position Precision when a relevant item is found
AP@K Average of precision@position for relevant hits
MAP@K Mean of AP@K across all users/queries
Would you like:
A visual diagram,
A Python code example, or
MCQs based on MAP@K?
Let’s now explore NDCG@K (Normalized Discounted Cumulative Gain) in a
simple, beginner-friendly way — using an analogy just like we did for
Precision@K, Recall@K, MRR, and MAP@K. 🌟
🎯 What is NDCG@K?
NDCG@K measures the ranking quality of your recommendations.
It rewards placing more relevant items higher in the list.
It’s used when relevance has levels (like 0 = not relevant, 1 = relevant, 2 =
highly relevant, etc.).
🧵 Real-Life Analogy: Movie Night Picks 🎬
Imagine your movie app recommends 5 movies to you.
You rate their relevance (how much you liked them):
Position Movie Your Rating (Relevance)
1 A 2 (Highly relevant)
2 B 0 (Not relevant)
3 C 1 (Somewhat relevant)
Position Movie Your Rating (Relevance)
4 D 2 (Highly relevant)
5 E 0 (Not relevant)
You liked movies A, C, and D, but Movie A is the most relevant and it’s at
the top, which is great!
Now we compute how good this ranking is using DCG and NDCG.
📌 Step 1: DCG@K (Discounted Cumulative Gain)
DCG@K=relevance1+relevance2log2(2)+relevance3log2(3)+…\text{DCG@K}
= \text{relevance}_1 + \frac{\text{relevance}_2}{\log_2(2)} + \frac{\
text{relevance}_3}{\log_2(3)} + \dots
📍 For our list (Top 5):
DCG@5=2+0log2(2)+1log2(3)+2log2(4)+0log2(5)≈2+0+0.63+1+0=3.63\
text{DCG@5} = 2 + \frac{0}{\log_2(2)} + \frac{1}{\log_2(3)} + \frac{2}{\
log_2(4)} + \frac{0}{\log_2(5)} \approx 2 + 0 + 0.63 + 1 + 0 = 3.63
📌 Step 2: IDCG@K (Ideal DCG)
This is the best possible order — highest relevance first.
Let’s rearrange the same movies based on best-case ranking:
Rank Movie Relevance
1 A 2
2 D 2
3 C 1
4 B 0
5 E 0
Now calculate ideal DCG:
IDCG@5=2+2log2(2)+1log2(3)+⋯≈2+2+0.63=4.63\text{IDCG@5} = 2 + \
frac{2}{\log_2(2)} + \frac{1}{\log_2(3)} + \dots \approx 2 + 2 + 0.63 = 4.63
📌 Step 3: NDCG@K = DCG@K / IDCG@K
NDCG@5=3.634.63≈0.78\text{NDCG@5} = \frac{3.63}{4.63} ≈ 0.78
✅ So your ranking was 78% as good as the perfect ranking.
💡 Why Use NDCG@K?
It accounts for graded relevance (not just yes/no)
It gives higher weight to relevant items at the top
It’s normalized, so the score is always between 0 (bad) and 1 (perfect)
✅ Summary:
Term Meaning
DCG@K Total relevance score (discounted by position)
IDCG@K Best possible DCG (perfect ranking)
NDCG@K How close your ranking is to ideal (0 to 1)
🆚 NDCG vs Other Metrics:
Metric Handles ranking? Handles graded relevance? Normalized?
Precision@K ❌ No ❌ No ❌ No
Recall@K ❌ No ❌ No ❌ No
MAP@K ✅ Yes ✅ Yes (binary only) ❌ No
NDCG@K ✅ Yes ✅ Yes (graded) ✅ Yes
Would you like:
A Python code example of NDCG?
Practice MCQs based on this?
Or a PDF cheatsheet of all ranking metrics?
Great! Let's explain RMSE (Root Mean Squared Error) in the context of
Recommendation Systems — in a simple, beginner-friendly way, perfect for
PG-DBDA.
📊 What is RMSE in Recommendation?
RMSE measures how far off your predicted ratings are from the
actual ratings given by the users.
In simple terms:
🎯 “How accurate is your rating prediction?”
📦 Real-Life Analogy: Movie Rating Prediction
Imagine:
You predict your friend will rate "Inception" ⭐⭐⭐⭐ (4 stars).
But your friend actually rates it ⭐⭐⭐ (3 stars).
This is an error of 1.
Now do this for many predictions, calculate how far each is off, square those
errors (to remove negatives), average them, and take the square root.
📌 That's RMSE.
🧠 Formula for RMSE:
RMSE=1n∑i=1n(r^i−ri)2\text{RMSE} = \sqrt{ \frac{1}{n} \sum_{i=1}^{n} (\
hat{r}_i - r_i)^2 }
Where:
r^i\hat{r}_i = predicted rating
rir_i = actual rating
nn = number of predictions
📘 Example:
Item Predicted Rating Actual Rating Error Squared Error
A 4.0 5.0 -1 1
B 3.5 3.0 0.5 0.25
C 2.0 2.0 0 0
D 4.5 4.0 0.5 0.25
📌 RMSE =
1+0.25+0+0.254=1.5/4=0.375≈0.61\sqrt{ \frac{1 + 0.25 + 0 + 0.25}{4} } = \
sqrt{1.5 / 4} = \sqrt{0.375} ≈ 0.61
✅ Why RMSE is Used in Recommendation Systems:
Feature Explanation
🎯 Measures Accuracy Tells how close predicted ratings are to real ones
💥 Penalizes Big Errors Larger mistakes are squared — so they count more
📈 Used for Evaluation Lower RMSE = better model performance
Feature Explanation
📌 When to Use RMSE?
Use RMSE when:
You're predicting continuous ratings (like 1–5 stars)
You want to penalize large errors more than small ones
🆚 RMSE vs MAE:
Metric Description Sensitive to big errors?
RMSE Root of squared errors ✅ Yes
MAE Mean of absolute errors ❌ Less sensitive
🎯 Key Points to Remember:
RMSE ∈ [0, ∞); lower is better
RMSE = 0 means perfect prediction
Common baseline: Compare to a model that always predicts average
rating
Would you like:
A code example using RMSE with scikit-learn or surprise library?
Practice MCQs based on RMSE?
A graphical explanation?