TASK: Run the following code to read in the "hotel_booking_data.csv" file.
Feel free
to explore the file a bit before continuing with the rest of the exercise.
In [1]: import pandas as pd
In [2]: hotels = pd.read_csv("C:\\Users\\HP\\Desktop\\Python\\Code\\UNZIP_FOR_NOTEBOOKS_FINAL\\03-Pandas
In [3]: hotels.head()
Out[3]:
hotel is_canceled lead_time arrival_date_year arrival_date_month arrival_date_week_number arrival_date_day_of_m
Resort
0 0 342 2015 July 27
Hotel
Resort
1 0 737 2015 July 27
Hotel
Resort
2 0 7 2015 July 27
Hotel
Resort
3 0 13 2015 July 27
Hotel
Resort
4 0 14 2015 July 27
Hotel
5 rows × 36 columns
TASK: How many rows are there?
In [4]: # CODE HERE
len(hotels)
Out[4]: 119390
TASK: Is there any missing data? If so, which column has the most missing data?
In [5]: # CODE HERE
hotels.isnull().sum()
Out[5]: hotel 0
is_canceled 0
lead_time 0
arrival_date_year 0
arrival_date_month 0
arrival_date_week_number 0
arrival_date_day_of_month 0
stays_in_weekend_nights 0
stays_in_week_nights 0
adults 0
children 4
babies 0
meal 0
country 488
market_segment 0
distribution_channel 0
is_repeated_guest 0
previous_cancellations 0
previous_bookings_not_canceled 0
reserved_room_type 0
assigned_room_type 0
booking_changes 0
deposit_type 0
agent 16340
company 112593
days_in_waiting_list 0
customer_type 0
adr 0
required_car_parking_spaces 0
total_of_special_requests 0
reservation_status 0
reservation_status_date 0
name 0
email 0
phone-number 0
credit_card 0
dtype: int64
In [6]: print(f"Yes, missing data, company column missing: {hotels['company'].isna().sum()} rows.")
Yes, missing data, company column missing: 112593 rows.
TASK: Drop the "company" column from the dataset.
In [7]: hotels.drop(columns=['company'],inplace=True)
TASK: What are the top 5 most common country codes in the dataset?
In [9]: hotels['country'].value_counts()[:5]
Out[9]: PRT 48590
GBR 12129
FRA 10415
ESP 8568
DEU 7287
Name: country, dtype: int64
TASK: What is the name of the person who paid the highest ADR (average daily rate)? How much was their
ADR?
In [10]: # CODE HERE
hotels.sort_values('adr',ascending=False)[['name','adr']].iloc[0]
Out[10]: name Daniel Walter
adr 5400.0
Name: 48515, dtype: object
TASK: The adr is the average daily rate for a person's stay at the hotel. What is the mean adr across all the
hotel stays in the dataset?
In [43]: # CODE HERE
round(hotels['adr'].mean(),2)
Out[43]: 101.83
TASK: What is the average (mean) number of nights for a stay across the entire data set? Feel free to round
this to 2 decimal points.
In [46]: # CODE HERE
total_night_stay=hotels['stays_in_week_nights']+hotels['stays_in_weekend_nights']
In [47]: round(total_night_stay.mean(),2)
Out[47]: 3.43
TASK: What is the average total cost for a stay in the dataset? Not average daily cost, but total stay cost. (You
will need to calculate total cost your self by using ADR and week day and weeknight stays). Feel free to round
this to 2 decimal points.
In [49]: # CODE HERE
total_cost=hotels['adr']*total_night_stay
In [52]: round(total_cost.mean(),2)
Out[52]: 357.85
TASK: What are the names and emails of people who made exactly 5 "Special Requests"?
In [58]: # CODE HERE
hotels[hotels['total_of_special_requests']==5][['name','email']]
Out[58]: name email
TASK: What percentage of hotel stays were classified as "repeat guests"? (Do not base this off the name of
the person, but instead of the is_repeated_guest column)
In [77]: round((hotels['is_repeated_guest']==1).sum()/len(hotels['is_repeated_guest'])*100,2)
Out[77]: 3.19
In [ ]:
TASK: What are the top 5 most common last name in the dataset? Bonus: Can you figure this out in one line
of pandas code? (For simplicity treat the a title such as MD as a last name, for example Caroline Conley MD
can be said to have the last name MD)
In [80]: #CODE HERE
first_last_name=hotels['name'].str.split()
In [82]: last_name=first_last_name.str[-1]
In [86]: hotels['name'].apply(lambda name: name.split()[1]).value_counts()[:5]
Out[86]: Smith 2510
Johnson 1998
Williams 1628
Jones 1441
Brown 1433
Name: name, dtype: int64
TASK: What are the names of the people who had booked the most number children and babies for their stay?
(Don't worry if they canceled, only consider number of people reported at the time of their reservation)
In [11]: hotels['total_kids']=hotels['babies']+hotels['children']
In [17]: hotels.sort_values('total_kids',ascending=False)[['name','adults','total_kids','babies','childre
Out[17]:
name adults total_kids babies children
328 Jamie Ramirez 2 10.0 0 10.0
46619 Nicholas Parker 2 10.0 10 0.0
78656 Marc Robinson 1 9.0 9 0.0
19718 Mr. Jeffrey Cross 2 3.0 0 3.0
107837 Albert French 2 3.0 2 1.0
... ... ... ... ... ...
119389 Ariana Michael 2 0.0 0 0.0
40600 Craig Campos 2 NaN 0 NaN
40667 David Murphy 2 NaN 0 NaN
40679 Frank Burton 3 NaN 0 NaN
41160 Jerry Roberts 2 NaN 0 NaN
119390 rows × 5 columns
TASK: What are the top 3 most common area code in the phone numbers? (Area code is first 3 digits)
In [18]: #CODE HERE
area_codes=hotels['phone-number'].str[:3]
In [20]: area_codes.value_counts()[:3]
Out[20]: 799 168
185 167
541 166
Name: phone-number, dtype: int64
TASK: How many arrivals took place between the 1st and the 15th of the month (inclusive of 1 and 15) ?
Bonus: Can you do this in one line of pandas code?
In [21]: #CODE HERE
hotels['arrival_date_day_of_month'].apply(lambda day:day in range(1,16)).sum()
Out[21]: 58152
HARD BONUS TASK: Create a table for counts for each day of the week that people arrived. (E.g. 5000 arrivals
were on a Monday, 3000 were on a Tuesday, etc..)
In [47]: def convert_to_proper(day,month,year):
return f'{day}-{month}-{year}'
In [50]: import numpy as np
hotels['date']=np.vectorize(convert_to_proper)(hotels['arrival_date_day_of_month'],
hotels['arrival_date_month'],
hotels['arrival_date_year'])
In [52]: date_to_day=hotels['date']
In [53]: date_to_day=pd.to_datetime(date_to_day)
In [55]: date_to_day.dt.day_name().value_counts()
Out[55]: Friday 19631
Thursday 19254
Monday 18171
Saturday 18055
Wednesday 16139
Sunday 14141
Tuesday 13999
Name: date, dtype: int64