NAME : Kanade Shubhada Sanjay
ROLL NO. : 65
DIV : A
MINI-PROJECT
Zomato-rating-prediction
1. Importing the libraires
In [1]: import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
1.1 Loading the dataset
In [2]: data = pd.read_csv('../input/zomato-bangalore-restaurants/zomato.csv')
In [3]: data
Out[3]:
url address name online_order book_t
942, 21st Main
https://www.zomato.com/bangalore/jalsa- Road, 2nd
0 Stage, Jalsa Yes
banasha...
Banashankari,
...
https://www.zomato.com/bangalore/spice- Spice
Elephant
1112, Next to
https://www.zomato.com/SanchurroBangalore? KIMS Medical San Churro
2 Yes
cont... College, 17th Cafe
Cross...
Addhuri
https://www.zomato.com/bangalore/addhuri-
Udupi
Bhojana
10, 3rd Floor,
https://www.zomato.com/bangalore/grand- Lakshmi Grand
4 No
village... Associates, Village
Gandhi Baza...
... ... ... ... ...
Best Brews
Four Points by
- Four
https://www.zomato.com/bangalore/best- Sheraton
51712 Points by No
brews-fo... Bengaluru,
Sheraton
43/3, White...
Bengaluru...
https://www.zomato.com/bangalore/vinod-
Palya, And
bar-and...
Mahadevapura, Restaurant
Plunge -
Sheraton
Sheraton
Grand
https://www.zomato.com/bangalore/plunge- Grand
51714 Bengaluru No
sherat... Bengaluru
Whitefield
Whitefield
Hotel & Co...
H...
Grand
Bengaluru
Whitefield
url address name online_order book_t
ITPL Main
Road, KIADB The Nest -
https://www.zomato.com/bangalore/the-nest-
51716 Export The Den No
the-...
Promotion Bengaluru
Industr...
51717 rows × 17 columns
1.2 checking the shape of dataset
In [4]: data.shape
Out[4]: (51717, 17)
there are total 51717 samples with 17 features.
In [5]: data.columns
Out[5]: Index(['url', 'address', 'name', 'online_order', 'book_table', 'rate', 'votes',
'phone', 'location', 'rest_type', 'dish_liked', 'cuisines',
'approx_cost(for two people)', 'reviews_list', 'menu_item',
'listed_in(type)', 'listed_in(city)'],
dtype='object')
1.3 checking the datatypes
In [6]: data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51717 entries, 0 to 51716
Data columns (total 17 columns):
# Column Non-Null Count Dtype
0 url 51717 non-null object
1 address 51717 non-null object
2 name 51717 non-null object
3 online_order 51717 non-null object
4 book_table 51717 non-null object
5 rate 43942 non-null object
6 votes 51717 non-null int64
7 phone 50509 non-null object
8 location 51696 non-null object
9 rest_type 51490 non-null object
10 dish_liked 23639 non-null object
11 cuisines 51672 non-null object
12 approx_cost(for two people) 51371 non-null object
13 reviews_list 51717 non-null object
14 menu_item 51717 non-null object
15 listed_in(type) 51717 non-null object
16 listed_in(city) 51717 non-null object
dtypes: int64(1), object(16)
memory usage: 6.7+ MB
there are so many object type columns, we have to convert them into numeric type. letter we will
convert oject dtype to numeric type
2. Data Cleaning
2.1 checking the missing values
In [7]: data.isnull().sum()
Out[7]: url 0
address 0
name 0
online_order 0
book_table 0
rate 7775
votes 0
phone 1208
location 21
rest_type 227
dish_liked 28078
cuisines 45
approx_cost(for two people) 346
reviews_list 0
menu_item 0
listed_in(type) 0
listed_in(city) 0
dtype: int64
there are so many null values.we can clearly see that in the 'rate', 'phone', 'location', 'rest_type',
'dish_liked', 'cuisines' and 'approx_cost(for two people)' these columns have missing values.So
firstly we have to handle the missing values.
2.2 Removing the unnecessary columns form data
In [8]: df = data.drop(['url', 'phone'], axis = 1) # dropped 'url' and 'phone' columns
In [9]: df.head()
Out[9]:
address name online_order book_table rate votes location rest_type dish_lik
Pas
942, 21st Lun
Main Road, Buff
Casual
0 2nd Stage, Jalsa Yes Yes 4.1/5 775 Banashankari Mas
Banashankari, Dining
Papa
... Pane
Laj
Mom
2nd Floor, 80 Lun
Feet Road, Spice Casual Buff
1 Yes No 4.1/5 787 Banashankari
Near Big Elephant Dining Chocola
Bazaar, 6th ... Nirva
Thai
Churr
1112, Next to San
KIMS Medical Cafe, Cannello
2 Churro
College, 17th Yes No 3.8/5 918 Banashankari Casual Minestro
Cafe Dining Soup, H
Cross...
Cho
Addhuri
Quick Mas
Bites
Bhojana
Banashankar...
10, 3rd Floor,
Casual
Lakshmi Grand No No 3.8/5 166 Basavanagudi Panipu
4
Associates, Village Dining Gol Gap
Gandhi Baza...
2.3 handling the null or missing values
In [10]: df.dropna(inplace = True)
In [11]: df.isnull().sum()
Out[11]: address 0
name 0
online_order 0
book_table 0
rate 0
votes 0
location 0
rest_type 0
dish_liked 0
cuisines 0
approx_cost(for two people) 0
reviews_list 0
menu_item 0
listed_in(type) 0
listed_in(city) 0
dtype: int64
Now there is no null values
2.4 checking the duplicates & handling the duplicates values
In [12]: df.duplicated().sum()
Out[12]: 11
In [13]: df.drop_duplicates(inplace = True)
df.duplicated().sum()
Out[13]: 0
Now there are no duplicate values.
In [ ]:
2.5 Renaming the columns appropriately
In [14]: df = df.rename(columns = {'approx_cost(for two people)':'cost',
'listed_in(type)':'type', 'listed_in(city)': 'city'})
In [15]: df.head()
Out[15]: address name online_order book_table rate votes location rest_type dish_lik
Pas
942, 21st Lun
Main Road, Casual Buff
Jalsa Yes Yes 4.1/5 775 Banashankari
0 2nd Stage, Mas
Dining
Banashankari, Papa
... Pane
Laj
Churr
1112, Next to San Cafe, Cannello
KIMS Medical
2 Churro Yes No 3.8/5 918 Banashankari Casual Minestro
College, 17th
Cafe Dining Soup, H
Cross...
Cho
Addhuri
Quick Mas
Bites
Bhojana
Banashankar...
10, 3rd Floor,
Lakshmi Grand Casual Panipu
4 No No 3.8/5 166 Basavanagudi
Associates, Village Dining Gol Gap
Gandhi Baza...
Sucessfully rename the columns
2.6 cleaning the "cost" column
In [16]: df['cost'].unique()
Out[16]: array(['800', '300', '600', '700', '550', '500', '450', '650', '400',
'750', '200', '850', '1,200', '150', '350', '250', '1,500',
'1,300', '1,000', '100', '900', '1,100', '1,600', '950', '230',
'1,700', '1,400', '1,350', '2,200', '2,000', '1,800', '1,900',
'180', '330', '2,500', '2,100', '3,000', '2,800', '3,400', '40',
'1,250', '3,500', '4,000', '2,400', '1,450', '3,200', '6,000',
'1,050', '4,100', '2,300', '120', '2,600', '5,000', '3,700',
'1,650', '2,700', '4,500'], dtype=object)
here we can see that data point is string type and some values like 5,000 6,000 have comma(,).
we have to remove that ',' from the values and we have convert them into numeric type.
In [17]: df['cost'] = df['cost'].apply(lambda x:x.replace(',', '')) # lo
df['cost'] = df['cost'].astype(float)
df['cost'].unique()
Out[17]: array([ 800., 300., 600., 700., 550., 500., 450., 650., 400.,
750., 200., 850., 1200., 150., 350., 250., 1500., 1300.,
1000., 100., 900., 1100., 1600., 950., 230., 1700., 1400.,
1350., 2200., 2000., 1800., 1900., 180., 330., 2500., 2100.,
3000., 2800., 3400., 40., 1250., 3500., 4000., 2400., 1450.,
3200., 6000., 1050., 4100., 2300., 120., 2600., 5000., 3700.,
1650., 2700., 4500.])
Now sucessfully we converted the values into numeric type
2.7 handling the rate columns
In [18]: df['rate'].unique()
Out[18]: array(['4.1/5', '3.8/5', '3.7/5', '4.6/5', '4.0/5', '4.2/5', '3.9/5',
'3.0/5', '3.6/5', '2.8/5', '4.4/5', '3.1/5', '4.3/5', '2.6/5',
'3.3/5', '3.5/5', '3.8 /5', '3.2/5', '4.5/5', '2.5/5', '2.9/5',
'3.4/5', '2.7/5', '4.7/5', 'NEW', '2.4/5', '2.2/5', '2.3/5',
'4.8/5', '3.9 /5', '4.2 /5', '4.0 /5', '4.1 /5', '2.9 /5',
'2.7 /5', '2.5 /5', '2.6 /5', '4.5 /5', '4.3 /5', '3.7 /5',
'4.4 /5', '4.9/5', '2.1/5', '2.0/5', '1.8/5', '3.4 /5', '3.6 /5',
'3.3 /5', '4.6 /5', '4.9 /5', '3.2 /5', '3.0 /5', '2.8 /5',
'3.5 /5', '3.1 /5', '4.8 /5', '2.3 /5', '4.7 /5', '2.4 /5',
'2.1 /5', '2.2 /5', '2.0 /5', '1.8 /5'], dtype=object)
here rating column also string type. we have to convert them into numeric type. we have to
remove the '/5' form given values.
there is 'NEW' value which make no sense. SO we have to remove that values.
In [19]: df = df.loc[df.rate != 'NEW'] # geting rid of 'NEW'
In [20]: df['rate'].unique()
Out[20]: array(['4.1/5', '3.8/5', '3.7/5', '4.6/5', '4.0/5', '4.2/5', '3.9/5',
'3.0/5', '3.6/5', '2.8/5', '4.4/5', '3.1/5', '4.3/5', '2.6/5',
'3.3/5', '3.5/5', '3.8 /5', '3.2/5', '4.5/5', '2.5/5', '2.9/5',
'3.4/5', '2.7/5', '4.7/5', '2.4/5', '2.2/5', '2.3/5', '4.8/5',
'3.9 /5', '4.2 /5', '4.0 /5', '4.1 /5', '2.9 /5', '2.7 /5',
'2.5 /5', '2.6 /5', '4.5 /5', '4.3 /5', '3.7 /5', '4.4 /5',
'4.9/5', '2.1/5', '2.0/5', '1.8/5', '3.4 /5', '3.6 /5', '3.3 /5',
'4.6 /5', '4.9 /5', '3.2 /5', '3.0 /5', '2.8 /5', '3.5 /5',
'3.1 /5', '4.8 /5', '2.3 /5', '4.7 /5', '2.4 /5', '2.1 /5',
'2.2 /5', '2.0 /5', '1.8 /5'], dtype=object)
In [21]: df['rate'] = df['rate'].apply(lambda x:x.replace('/5', ''))
df['rate'].unique()
Out[21]: array(['4.1', '3.8', '3.7', '4.6', '4.0', '4.2', '3.9', '3.0', '3.6',
'2.8', '4.4', '3.1', '4.3', '2.6', '3.3', '3.5', '3.8 ', '3.2',
'4.5', '2.5', '2.9', '3.4', '2.7', '4.7', '2.4', '2.2', '2.3',
'4.8', '3.9 ', '4.2 ', '4.0 ', '4.1 ', '2.9 ', '2.7 ', '2.5 ',
'2.6 ', '4.5 ', '4.3 ', '3.7 ', '4.4 ', '4.9', '2.1', '2.0', '1.8',
'3.4 ', '3.6 ', '3.3 ', '4.6 ', '4.9 ', '3.2 ', '3.0 ', '2.8 ',
'3.5 ', '3.1 ', '4.8 ', '2.3 ', '4.7 ', '2.4 ', '2.1 ', '2.2 ',
'2.0 ', '1.8 '], dtype=object)
In [22]: df['rate'] = df['rate'].apply(lambda x: float(x))
df['rate']
Out[22]: 0 4.1
1 4.1
2 3.8
3 3.7
4 3.8
...
51705 3.8
51707 3.9
51708 2.8
51711 2.5
51715 4.3
Name: rate, Length: 23248, dtype: float64
Now our data is cleaned and we can perform visulization
3. Data Visulaization
3.1 Most famous restaurant chains in banaglore
In [23]: plt.figure(figsize = (17,10))
chains = df['name'].value_counts()[:20]
sns.barplot(x = chains, y= chains.index, palette= 'deep')
plt.title('Most famous restaurants chains in bangalore')
plt.xlabel('Number of outlets')
plt.show()
Insights:
'Onesta', 'Empire Restaurant' & 'KFC' are the most famous restaurant in bangalore.
In [ ]:
3.2 checking online order or not
In [24]: v = df['online_order'].value_counts()
fig = plt.gcf()
fig.set_size_inches((10,6))
cmap = plt.get_cmap('Set3')
color = cmap(np.arange(len(v)))
plt.pie(v, labels = v.index, wedgeprops= dict(width = 0.6),autopct = '%0.02f', shadow = Tru
plt.title('Online orders', fontsize = 20)
plt.show()
Insight:
Most Restaurants offer option for online order and delivery.
3.3 Book table or not
In [25]: v = df['book_table'].value_counts()
fig = plt.gcf()
fig.set_size_inches((8,6))
cmap = plt.get_cmap('Set1')
color = cmap(np.arange(len(v)))
plt.pie(v, labels = v.index, wedgeprops= dict(width = 0.6),autopct = '%0.02f', shadow = Tru
plt.title('Book Table', fontsize = 20)
plt.show()
Insight:
Most of restaurants doesn't offer table booking.
3.4 Rating Distribution
In [26]: plt.figure(figsize = (9,7))
sns.distplot(df['rate'])
plt.title('Rating Distribution')
Out[26]: Text(0.5, 1.0, 'Rating Distribution')
Insight:
We can infer from above that most of the ratings are within 3.5 and 4.5