Uber Drive:
The project is based on the trips made by Uber drivers. Here, we are analyzing different aspects of the trips by doing Exploratory
Data Analysis
Load the necessary libraries. Import and load the dataset with a
name uber_drives .
In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
In [4]:
ud = pd.read_csv("uberdrives.csv")
In [5]:
ud.shape
Out[5]:
(1155, 7)
Q1. Show the last 10 records of the dataset. (2 point)
In [6]:
ud.tail(10)
Out[6]:
START_DATE* END_DATE* CATEGORY* START* STOP* MILES* PURPOSE*
1145 12/30/2016 10:15 12/30/2016 10:33 Business Karachi Karachi 2.8 Errand/Supplies
1146 12/30/2016 11:31 12/30/2016 11:56 Business Karachi Karachi 2.9 Errand/Supplies
1147 12/30/2016 15:41 12/30/2016 16:03 Business Karachi Karachi 4.6 Errand/Supplies
1148 12/30/2016 16:45 12/30/2016 17:08 Business Karachi Karachi 4.6 Meeting
1149 12/30/2016 23:06 12/30/2016 23:10 Business Karachi Karachi 0.8 Customer Visit
1150 12/31/2016 1:07 12/31/2016 1:14 Business Karachi Karachi 0.7 Meeting
Unknown
1151 12/31/2016 13:24 12/31/2016 13:42 Business Karachi 3.9 Temporary Site
Location
Unknown Unknown
1152 12/31/2016 15:03 12/31/2016 15:38 Business 16.2 Meeting
Location Location
1153 12/31/2016 21:32 12/31/2016 21:50 Business Katunayake Gampaha 6.4 Temporary Site
1154 12/31/2016 22:08 12/31/2016 23:51 Business Gampaha Ilukwatta 48.2 Temporary Site
Q2. Show the first 10 records of the dataset. (2 points)
In [7]:
ud.head(10)
Out[7]:
Out[7]:
START_DATE* END_DATE* CATEGORY* START* STOP* MILES* PURPOSE*
0 01-01-2016 21:11 01-01-2016 21:17 Business Fort Pierce Fort Pierce 5.1 Meal/Entertain
1 01-02-2016 01:25 01-02-2016 01:37 Business Fort Pierce Fort Pierce 5.0 NaN
2 01-02-2016 20:25 01-02-2016 20:38 Business Fort Pierce Fort Pierce 4.8 Errand/Supplies
3 01-05-2016 17:31 01-05-2016 17:45 Business Fort Pierce Fort Pierce 4.7 Meeting
West Palm
4 01-06-2016 14:42 01-06-2016 15:49 Business Fort Pierce 63.7 Customer Visit
Beach
West Palm West Palm
5 01-06-2016 17:15 01-06-2016 17:19 Business 4.3 Meal/Entertain
Beach Beach
West Palm
6 01-06-2016 17:30 01-06-2016 17:35 Business Palm Beach 7.1 Meeting
Beach
7 01-07-2016 13:27 01-07-2016 13:33 Business Cary Cary 0.8 Meeting
8 01-10-2016 08:05 01-10-2016 08:25 Business Cary Morrisville 8.3 Meeting
9 01-10-2016 12:17 01-10-2016 12:44 Business Jamaica New York 16.5 Customer Visit
Q3. Show the dimension(number of rows and columns) of the dataset. (2
points)
In [8]:
ud.shape
Out[8]:
(1155, 7)
In [9]:
print("The number of rows are ",ud.shape[0],"\nThe number of columns are",ud.shape[1])
The number of rows are 1155
The number of columns are 7
Q4. Show the size (Total number of elements) of the dataset. (2 points)
In [10]:
ud.size
Out[10]:
8085
Q5. Display the information about all the variables of the data set. What can
you infer from the output?(1 +2 points)
Hint: Information includes - Total number of columns,variable data-types, number of non-null values in a variable, and
usage
In [11]:
ud.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1155 entries, 0 to 1154
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 START_DATE* 1155 non-null object
0 START_DATE* 1155 non-null object
1 END_DATE* 1155 non-null object
2 CATEGORY* 1155 non-null object
3 START* 1155 non-null object
4 STOP* 1155 non-null object
5 MILES* 1155 non-null float64
6 PURPOSE* 653 non-null object
dtypes: float64(1), object(6)
memory usage: 63.3+ KB
There are 653 non-null values which means there are (1155-653)=502 null values in Variable Column "PURPOSE". Variable
Column"MILES" have Continous Variable data & all other Variable Columns have Categorical Variable data.
Q6. Check for missing values. (2 points)
Note: Output should contain only one boolean value
In [12]:
ud.isnull()
Out[12]:
START_DATE* END_DATE* CATEGORY* START* STOP* MILES* PURPOSE*
0 False False False False False False False
1 False False False False False False True
2 False False False False False False False
3 False False False False False False False
4 False False False False False False False
... ... ... ... ... ... ... ...
1150 False False False False False False False
1151 False False False False False False False
1152 False False False False False False False
1153 False False False False False False False
1154 False False False False False False False
1155 rows × 7 columns
In [13]:
ud.isnull().sum()
Out[13]:
START_DATE* 0
END_DATE* 0
CATEGORY* 0
START* 0
STOP* 0
MILES* 0
PURPOSE* 502
dtype: int64
There are missing values present in the "PURPOSE" Column variables of the dataset.
Q7. How many missing values are present in the entire dataset? (2 points)
In [174]:
ud.isnull().sum()
Out[174]:
Out[174]:
START_DATE 0
END_DATE 0
CATEGORY 0
START 0
STOP 0
MILES 0
PURPOSE 502
dtype: int64
The missing values present in entire data set is 502.
Q8. Get the summary of the original data. (2 points).
Hint: Summary includes- Count,Mean, Std, Min, 25%,50%,75% and max
In [99]:
ud.describe(include="all").T
Out[99]:
count unique top freq mean std min 25% 50% 75% max
START_DATE 1155 1154 6/28/2016 23:34 2 NaN NaN NaN NaN NaN NaN NaN
END_DATE 1155 1154 6/28/2016 23:59 2 NaN NaN NaN NaN NaN NaN NaN
CATEGORY 1155 2 Business 1078 NaN NaN NaN NaN NaN NaN NaN
START 1155 176 Cary 201 NaN NaN NaN NaN NaN NaN NaN
STOP 1155 187 Cary 203 NaN NaN NaN NaN NaN NaN NaN
MILES 1155 NaN NaN NaN 10.5668 21.5791 0.5 2.9 6 10.4 310.3
PURPOSE 653 10 Meeting 187 NaN NaN NaN NaN NaN NaN NaN
Q9. Drop the missing values and store the data in a new dataframe (name
it"df") (2-points)
Note: Dataframe "df" will not contain any missing value
In [178]:
df = ud.dropna()
df
Out[178]:
START_DATE END_DATE CATEGORY START STOP MILES PURPOSE
0 01-01-2016 21:11 01-01-2016 21:17 Business Fort Pierce Fort Pierce 5.1 Meal/Entertain
2 01-02-2016 20:25 01-02-2016 20:38 Business Fort Pierce Fort Pierce 4.8 Errand/Supplies
3 01-05-2016 17:31 01-05-2016 17:45 Business Fort Pierce Fort Pierce 4.7 Meeting
4 01-06-2016 14:42 01-06-2016 15:49 Business Fort Pierce West Palm Beach 63.7 Customer Visit
5 01-06-2016 17:15 01-06-2016 17:19 Business West Palm Beach West Palm Beach 4.3 Meal/Entertain
... ... ... ... ... ... ... ...
1150 12/31/2016 1:07 12/31/2016 1:14 Business Karachi Karachi 0.7 Meeting
Unknown
1151 12/31/2016 13:24 12/31/2016 13:42 Business Karachi 3.9 Temporary Site
Location
Unknown Unknown
1152 12/31/2016 15:03 12/31/2016 15:38 Business 16.2 Meeting
Location Location
1153 12/31/2016 21:32 12/31/2016 21:50 Business Katunayake Gampaha 6.4 Temporary Site
1154 12/31/2016 22:08 12/31/2016 23:51 Business Gampaha Ilukwatta 48.2 Temporary Site
653 rows × 7 columns
Q10. Check the information of the dataframe(df). (1 points)
Hint: Information includes - Total number of columns,variable data-types, number of non-null values in a variable, and
usage
In [73]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 653 entries, 0 to 1154
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 START_DATE* 653 non-null object
1 END_DATE* 653 non-null object
2 CATEGORY* 653 non-null object
3 START* 653 non-null object
4 STOP* 653 non-null object
5 MILES* 653 non-null float64
6 PURPOSE* 653 non-null object
dtypes: float64(1), object(6)
memory usage: 40.8+ KB
Q11. Get the unique start locations. (2 points)
Note: This question is based on the dataframe with no 'NA' values
In [106]:
df.columns = df.columns.str.replace('*','', regex = True)
In [107]:
df.head()
Out[107]:
START_DATE END_DATE CATEGORY START STOP MILES PURPOSE
0 01-01-2016 21:11 01-01-2016 21:17 Business Fort Pierce Fort Pierce 5.1 Meal/Entertain
2 01-02-2016 20:25 01-02-2016 20:38 Business Fort Pierce Fort Pierce 4.8 Errand/Supplies
3 01-05-2016 17:31 01-05-2016 17:45 Business Fort Pierce Fort Pierce 4.7 Meeting
West Palm
4 01-06-2016 14:42 01-06-2016 15:49 Business Fort Pierce 63.7 Customer Visit
Beach
West Palm West Palm
5 01-06-2016 17:15 01-06-2016 17:19 Business 4.3 Meal/Entertain
Beach Beach
In [108]:
df.START.unique()
Out[108]:
array(['Fort Pierce', 'West Palm Beach', 'Cary', 'Jamaica', 'New York',
'Elmhurst', 'Midtown', 'East Harlem', 'Flatiron District',
'Midtown East', 'Hudson Square', 'Lower Manhattan',
"Hell's Kitchen", 'Downtown', 'Gulfton', 'Houston', 'Eagan Park',
'Morrisville', 'Durham', 'Farmington Woods', 'Lake Wellingborough',
'Fayetteville Street', 'Raleigh', 'Whitebridge', 'Hazelwood',
'Fairmont', 'Meredith Townes', 'Apex', 'Chapel Hill', 'Northwoods',
'Edgehill Farms', 'Eastgate', 'East Elmhurst', 'Long Island City',
'Katunayaka', 'Colombo', 'Nugegoda', 'Unknown Location',
'Katunayaka', 'Colombo', 'Nugegoda', 'Unknown Location',
'Islamabad', 'R?walpindi', 'Noorpur Shahan', 'Preston',
'Heritage Pines', 'Tanglewood', 'Waverly Place', 'Wayne Ridge',
'Westpark Place', 'East Austin', 'The Drag', 'South Congress',
'Georgian Acres', 'North Austin', 'West University', 'Austin',
'Katy', 'Sharpstown', 'Sugar Land', 'Galveston', 'Port Bolivar',
'Washington Avenue', 'Briar Meadow', 'Latta', 'Jacksonville',
'Lake Reams', 'Orlando', 'Kissimmee', 'Daytona Beach', 'Ridgeland',
'Florence', 'Meredith', 'Holly Springs', 'Chessington', 'Burtrose',
'Parkway', 'Mcvan', 'Capitol One', 'University District',
'Seattle', 'Redmond', 'Bellevue', 'San Francisco', 'Palo Alto',
'Sunnyvale', 'Newark', 'Menlo Park', 'Old City', 'Savon Height',
'Kilarney Woods', 'Townes at Everett Crossing', 'Huntington Woods',
'Weston', 'Seaport', 'Medical Centre', 'Rose Hill', 'Soho',
'Tribeca', 'Financial District', 'Oakland', 'Emeryville',
'Berkeley', 'Kenner', 'CBD', 'Lower Garden District', 'Storyville',
'New Orleans', 'Chalmette', 'Arabi', 'Pontchartrain Shores',
'Metairie', 'Summerwinds', 'Parkwood', 'Banner Elk', 'Boone',
'Stonewater', 'Lexington Park at Amberly', 'Winston Salem',
'Asheville', 'Topton', 'Renaissance', 'Santa Clara', 'Ingleside',
'West Berkeley', 'Mountain View', 'El Cerrito', 'Krendle Woods',
'Fuquay-Varina', 'Rawalpindi', 'Lahore', 'Karachi', 'Katunayake',
'Gampaha'], dtype=object)
In [109]:
df.START.nunique()
Out[109]:
131
Q12. What is the total number of unique start locations? (2 points)
Note: Use the original dataframe without dropping 'NA' values
In [110]:
ud.columns = ud.columns.str.replace('*','', regex =True)
In [111]:
ud.head()
Out[111]:
START_DATE END_DATE CATEGORY START STOP MILES PURPOSE
0 01-01-2016 21:11 01-01-2016 21:17 Business Fort Pierce Fort Pierce 5.1 Meal/Entertain
1 01-02-2016 01:25 01-02-2016 01:37 Business Fort Pierce Fort Pierce 5.0 NaN
2 01-02-2016 20:25 01-02-2016 20:38 Business Fort Pierce Fort Pierce 4.8 Errand/Supplies
3 01-05-2016 17:31 01-05-2016 17:45 Business Fort Pierce Fort Pierce 4.7 Meeting
West Palm
4 01-06-2016 14:42 01-06-2016 15:49 Business Fort Pierce 63.7 Customer Visit
Beach
In [112]:
ud.START.nunique()
Out[112]:
176
Q13. What is the total number of unique stop locations. (2 points)
Note: Use the original dataframe without dropping 'NA' values.
Note: Use the original dataframe without dropping 'NA' values.
In [113]:
ud.STOP.nunique()
Out[113]:
187
Q14. Display all Uber trips that has the starting point as San Francisco. (2
points)
Note: Use the original dataframe without dropping the 'NA' values.
In [114]:
ud[ud['START'] == 'San Francisco']
Out[114]:
START_DATE END_DATE CATEGORY START STOP MILES PURPOSE
Between
362 05-09-2016 14:39 05-09-2016 15:06 Business San Francisco Palo Alto 20.5
Offices
440 6/14/2016 16:09 6/14/2016 16:39 Business San Francisco Emeryville 11.6 Meeting
836 10/19/2016 14:02 10/19/2016 14:31 Business San Francisco Berkeley 10.8 NaN
Between
917 11-07-2016 19:17 11-07-2016 19:57 Business San Francisco Berkeley 13.2
Offices
919 11-08-2016 12:16 11-08-2016 12:49 Business San Francisco Berkeley 11.3 Meeting
927 11-09-2016 18:40 11-09-2016 19:17 Business San Francisco Oakland 12.7 Customer Visit
933 11-10-2016 15:17 11-10-2016 15:22 Business San Francisco Oakland 9.9 Temporary Site
966 11/15/2016 20:44 11/15/2016 21:00 Business San Francisco Berkeley 11.8 Temporary Site
Q15. What is the most popular starting point for the Uber drivers? (2
points)
Note: Use the original dataframe without dropping the 'NA' values.
Hint:Popular means the place that is visited the most
In [185]:
ud.START.value_counts().head(1)
Out[185]:
Cary 201
Name: START, dtype: int64
'Cary' is the most popular starting point for the Uber drivers.
Q16. What is the most popular dropping point for the Uber drivers? (2
points)
Note: Use the original dataframe without dropping the 'NA' values.
Hint: Popular means the place that is visited the most
In [186]:
ud.STOP.value_counts().head(1)
Out[186]:
Cary 203
Name: STOP, dtype: int64
'Cary' is the most popular dropping point for the Uber drivers.
Q17. What is the most frequent route taken by Uber drivers. (3 points)
Note: This question is based on the new dataframe with no 'na' values.
Hint-Print the most frequent route taken by Uber drivers (Route= combination of START & END points present in the Data
set).
In [255]:
sta = df["START"].sort_values()
stp = df["STOP"].sort_values()
tot = sta + stp
tot.value_counts().head(5)
Out[255]:
CaryMorrisville 52
MorrisvilleCary 51
CaryCary 44
Unknown LocationUnknown Location 30
CaryDurham 30
dtype: int64
In [279]:
#df.groupby('START')['STOP'].value_counts(ascending=False).max()
Out[279]:
52
Q18. Display all types of purposes for the trip in an array. (2 points)
Note: This question is based on the new dataframe with no 'NA' values.
In [139]:
df.PURPOSE.unique()
Out[139]:
array(['Meal/Entertain', 'Errand/Supplies', 'Meeting', 'Customer Visit',
'Temporary Site', 'Between Offices', 'Charity ($)', 'Commute',
'Moving', 'Airport/Travel'], dtype=object)
Q19. Plot a bar graph of Purpose vs Miles(Distance). What can you infer from
the plot(2 +2 points)
Note: Use the original dataframe without dropping the 'NA' values.
Hint:You have to plot total/sum miles per purpose
In [147]:
plt.figure(figsize = (12,5))
plt.figure(figsize = (12,5))
sns.barplot(ud['PURPOSE'],ud['MILES'])
plt.xticks(rotation=45)
plt.show()
In [ ]:
Maximum distance travelled by Uber is to commute.
Q20. Display a dataframe of Purpose and the total distance travelled for that
particular Purpose. (3 points)
Note: Use the original dataframe without dropping "NA" values
In [193]:
ud[['PURPOSE', 'MILES']]
Out[193]:
PURPOSE MILES
0 Meal/Entertain 5.1
1 None 5.0
2 Errand/Supplies 4.8
3 Meeting 4.7
4 Customer Visit 63.7
... ... ...
1150 Meeting 0.7
1151 Temporary Site 3.9
1152 Meeting 16.2
1153 Temporary Site 6.4
1154 Temporary Site 48.2
1155 rows × 2 columns
Q21. Generate a plot showing count of trips vs category of trips. What can you
infer from the plot (2 +1 points)
Note: Use the original dataframe without dropping the 'NA' values.
In [259]:
sns.countplot(ud['CATEGORY'])
plt.xticks(rotation = 45)
plt.show()
The maximum count of trips is under Business Category.(Maximum usage of Uber is for Business Purpose and rather than for
Personal Purpose)
Q22. What percentage of Miles were clocked under Business Category and
what percentage of Miles were clocked under Personal Category ? (3
points)
Note:Use the original dataframe without dropping the 'NA' values.
In [284]:
ud.groupby('CATEGORY')['MILES'].sum()
Out[284]:
CATEGORY
Business 11487.0
Personal 717.7
Name: MILES, dtype: float64
In [218]:
[11487.0/(11487.0 + 717.7)] # For Business Purpose
Out[218]:
[0.9411947856153776]
In [219]:
[717.7/(11487.0 + 717.7)] # For Personal Purpose
Out[219]:
[0.058805214384622315]
94.12% of miles were clocked under Business Category and 5.88% of miles were clocked under Personal Category.
THE END