0% found this document useful (0 votes)

243 views40 pages

Uber Data Analysis Using Machine Learning

The document presents an analysis of Uber's dataset from New York City, utilizing machine learning techniques, specifically the k-means clustering algorithm, to improve taxi dispatching and predict demand across the city. It includes a literature survey on related works, discusses the system design and implementation, and outlines the data collection and preprocessing methods used in the analysis. The paper emphasizes the importance of accurate data modeling and evaluation in enhancing the efficiency of ridesharing services.

Uploaded by

mellifluous788

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

243 views40 pages

Uber Data Analysis Using Machine Learning

Uploaded by

mellifluous788

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 40

Uber data analysis using Machine

learning
ABSTRACT

The paper demonstrates how an Uber dataset, which incorporates Uber's data for New
York City, works. Uber is classified as a peer-to-peer (P2P) platform. The website
connects you with drivers who will drive you to your desired location. The dataset
contains primary data on Uber pickups, which includes information such as the date,
time, and longitude-latitude coordinates. The paper uses the information to describe
how to utilise the k-means clustering algorithm to categorise the various sections of
New York City. Because the sector is expanding and projected to continue to develop
in the near future. Effective taxi dispatching will allow each driver and passenger to
spend less time looking for each other.The model is used to forecast demand at
various locations throughout the city.

CHAPTER 1: INTRODUCTION
Introduction:-

Uber connects you with drivers who can drive you to your desired location or
destination. This dataset contains primary data on Uber collections in San Francisco,
including the date, time of trip, and longitude and latitude information. Uber operates
in over 900 metropolitan regions globally. Implementing a section of the k-means
clustering algorithm predicts the frequency of data travels.

CHAPTER 2: LITERATURE SURVEY

Literature survey 1:-

Title:- “Spatio-temporal pricing for ridesharing platform”

Year:- 2019
Authors:- Hongyao Ma , Fei Fang , David C. Parkes
Abstract:-
Ridesharing platforms match drivers and riders to trips, using dynamic prices to
balance supply and demand. A challenge is to set prices that are appropriately smooth
in space and time, so that drivers will choose to accept their dispatched trips, rather
than drive to another area or wait for higher prices or a better trip. We work in a
complete information, discrete time, multi-period, multi-location model, and
introduce the Spatio-Temporal Pricing (STP) mechanism. The mechanism is
incentive-aligned, in that it is a subgame-perfect equilibrium for drivers to always
accept their trip dispatches. The mechanism is also welfare-optimal, envy-free,
individually rational, budget balanced and core-selecting in equilibrium from any
history onward. The proof of incentive alignment makes use of the M♯ concavity of
minimum cost flow objectives. We also give an impossibility result, that there can be
no dominant-strategy mechanism with the same economic properties. Simulation
results suggest that the STP mechanism can achieve significantly higher social
welfare than a myopic pricing mechanism.
Literature survey 2:-

Title:- “How cities use regulation for innovation: the case of Uber, Lyft and Sidecar in San
Francisco”
Year:- 2017
Authors:- Onesimo Flores a , Lisa Rayle b
Abstract:-
How do government actors facilitate or hinder private innovation in urban mobility,
and how does local context mediate this relationship? In this paper we examine the
regulatory response to on-demand ride services—or “ridesourcing”—through a case
study of San Francisco, CA. The entry of Lyft, Sidecar, and UberX in San Francisco
in 2012 raised serious questions about the legality of ridesourcing, and sparked
significant conflict within regulatory agencies. After sustained debate, regulators
decided to welcome the services provided by new companies and crafted a new
regulatory framework that legalized the provision of forprofit, on-demand ride
services using personal vehicles. We ask, given strong arguments on each side, what
motivated public officials in each city to facilitate, rather than hinder, the new
services? How did they achieve regulatory reform?.

Literature survey 3:-

Title:- “Web Information Extraction on Multiple Ontologies Based on Concept Relationships

upon Training the User Profiles”
Year:- 2015
Authors:- Vigneshwari S ., Aramudhan Murugaiyan
Abstract:-
There is a need of personalized Web information extraction. Mining vast information
across the Web is not an easy task. We need to undergo various reduction techniques
to remove unwanted data and to grab the useful information from the Web resources.
Ontology is the best way for representing the useful information. In this paper, we
have planned to develop a model based on multiple ontologies. From the constructed
ontologies based on the mutual information among the concepts the taxonomy is
constructed, then the relationship among the concepts is calculated. Thereby, the
useful information is extracted. An algorithm is proposed for the same. The results
show that the computation time for data extraction is reduced as the size of the
database increases. This shows a healthy improvement for quick access of useful data
from a huge information resource like the Internet.

3.1 Existing system

Supervised algorithms used in the Existing system of the problem statement

3.2 Drawbacks

 Less amount of accuracy score

 Small level data-set
 Applicable on small level prediction work
3.3 Proposed system

Based on the problems of forecasting errors and risk of overfitting due to large
datasets. The data analyzed and sent to the company is resulted as inefficient and
ineffective. Thus to overcome the problem we are going to predict the pickup of cab
from a coordinated cluster of points predicted by using applied k-means clustering
algorithm.

3.4 Advantages

 Increasing the accuracy score

 Large amount of feature we are taking for the training and testing.

3.5 System Requirements

Hardware:
 OS – Windows 7, 8 and 10 (32 and 64 bit)
 RAM – 4GB
Software:
 Anaconda Navigator
 in Python Language
 Jupyter Notebook

3.6 Feasibility study

Feasibility study in the sense it's a practical approach of implementing the proposed model of system .
Here for a machine learning projects .we generally collect the input from online websites and filter the
input data and visualize them in graphical format and then the data is divided for training and testing .
That training is testing data is given to the algorithms to predict the data .
1. First, we take dataset.
2. Filter dataset according to requirements and create a new dataset which has
attribute according to analysis to be done
3. Perform Pre-Processing on the dataset
4. Split the data into training and testing
5. Train the model with training data then analyze testing dataset over classification
algorithm
6. Finally you will get results as accuracy metrics.

CHAPTER 4: SYSTEM DESIGN

4.1 System architecture
4.2 Modules
1. DATA COLLECTION
2. DATA PRE-PROCESSING
3. FEATURE EXTRATION
4. EVALUATION MODE

DATA COLLECTION
Data collection is a process in which information is gathered from many sources
which is later used to develop the machine learning models. The data should be
stored in a way that makes sense for problem. In this step the data set is converted
into the understandable format which can be fed into machine learning models.
Data used in this paper is a set of cervical cancer data with 15 features . This step
is concerned with selecting the subset of all available data that you will be
working with. ML problems start with data preferably, lots of data (examples or
observations) for which you already know the target answer. Data for which you
already know the target answer is called labelled data.
DATA PRE-PROCESSING
Organize your selected data by formatting, cleaning and sampling from it.
Three common data pre-processing steps are:

Formatting: The data you have selected may not be in a format that is suitable for you
to work with. The data may be in a relational database and you would like it in a flat
file, or the data may be in a proprietary file format and you would like it in a
relational database or a text file.

Cleaning: Cleaning data is the removal or fixing of missing data. There may be data
instances that are incomplete and do not carry the data you believe you need to
address the problem. These instances may need to be removed. Additionally, there
may be sensitive information in some of the attributes and these attributes may need
to be anonymized or removed from the data entirely.

Sampling: There may be far more selected data available than you need to work with.
More data can result in much longer running times for algorithms and larger
computational and memory requirements. You can take a smaller representative
sample of the selected data that may be much faster for exploring and prototyping
solutions before considering the whole dataset.

FEATURE EXTRATION
Next thing is to do Feature extraction is an attribute reduction process. Unlike feature
selection, which ranks the existing attributes according to their predictive
significance, feature extraction actually transforms the attributes. The transformed
attributes, or features, are linear combinations of the original attributes. Finally, our
models are trained using Classifier algorithm. We use classify module on Natural
Language Toolkit library on Python. We use the labelled dataset gathered. The rest of
our labelled data will be used to evaluate the models. Some machine learning
algorithms were used to classify pre-processed data. The chosen classifiers were
Random forest. These algorithms are very popular in text classification tasks.
EVALUATION MODEL
Model Evaluation is an integral part of the model development process. It helps to
find the best model that represents our data and how well the chosen model will work
in the future. Evaluating model performance with the data used for training is not
acceptable in data science because it can easily generate overoptimistic and over fitted
models. There are two methods of evaluating models in data science, Hold-Out and
Cross-Validation. To avoid over fitting, both methods use a test set (not seen by the
model) to evaluate model performance.
Performance of each classification model is estimated base on its averaged. The result
will be in the visualized form. Representation of classified data in the form of graphs.
Accuracy is defined as the percentage of correct predictions for the test data. It can be
calculated easily by dividing the number of correct predictions by the number of total
predictions.
4.3 UML Diagrams
The Unified Modeling Language (UML) is used to specify, visualize, modify,
construct and document the artifacts of an object-oriented software intensive system
under development. UML offers a standard way to visualize a system's architectural
blueprints, including elements such as:
 actors
 business processes
 (logical) components
 activities
 programming language statements
 database schemas, and
 Reusable software components.
UML combines best techniques from data modeling (entity relationship diagrams),
business modeling (work flows), object modeling, and component modeling. It can be
used with all processes, throughout the software development life cycle, and across
different implementation technologies. UML has synthesized the notations of the
Booch method, the Object-modeling technique (OMT) and Object-oriented software
engineering (OOSE) by fusing them into a single, common and widely usable
modeling language. UML aims to be a standard modeling language which can model
concurrent and distributed systems.
Sequence Diagram:
Sequence Diagrams Represent the objects participating the interaction horizontally
and time vertically. A Use Case is a kind of behavioral classifier that represents a
declaration of an offered behavior. Each use case specifies some behavior, possibly
including variants that the subject can perform in collaboration with one or more
actors. Use cases define the offered behavior of the subject without reference to its
internal structure. These behaviors, involving interactions between the actor and the
subject, may result in changes to the state of the subject and communications with its
environment. A use case can include possible variations of its basic behavior,
including exceptional behavior and error handling.
Activity Diagrams-:
Activity diagrams are graphical representations of Workflows of stepwise activities
and actions with support for choice, iteration and concurrency. In the Unified
Modeling Language, activity diagrams can be used to describe the business and
operational step-by-step workflows of components in a system. An activity diagram
shows the overall flow of control.
Usecase diagram:
UML is a standard language for specifying, visualizing, constructing, and
documenting the artifacts of software systems.
UML was created by Object Management Group (OMG) and UML 1.0 specification
draft was proposed to the OMG in January 1997.
OMG is continuously putting effort to make a truly industry standard.
UML stands for Unified Modeling Language.
UML is a pictorial language used to make software blue prints
Class diagram
The class diagram is the main building block of object-oriented modeling. It is used
for general conceptual modeling of the systematic of the application, and for detailed
modeling translating the models into programming code. Class diagrams can also be
used for data modeling.[1] The classes in a class diagram represent both the main
elements, interactions in the application, and the classes to be programmed.
In the diagram, classes are represented with boxes that contain three compartments:
The top compartment contains the name of the class. It is printed in bold and centered,
and the first letter is capitalized.
The middle compartment contains the attributes of the class. They are left-aligned and
the first letter is lowercase.
The bottom compartment contains the operations the class can execute. They are also
left-aligned and the first letter is lowercase.

CHAPTER 5: SYSTEM IMPLEMENTATION

4.1 Source code

!pip3 -q install numpy pandas matplotlib seaborn geopy folium datetime scipy sklearn tensorflow

#The following libraries are required to run this notebook

%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
import geopy.distance
from math import radians,cos,sin,asin,sqrt
import folium
import datetime
from folium.plugins import HeatMap
from scipy.stats import ttest_ind

matplotlib.rcParams.update({'font.size': 12})

Reading the uber dataset

!ls ../input/uber-pickups-in-new-york-city/uber-raw-data-jul14.csv

uber_data = pd.read_csv('uber-raw-data-jul14.csv')

# Print the first 10 elements

uber_data.head(10)

#print the type of data in Date/Time

type(uber_data.loc[0,'Date/Time'])

**The type is str!. Let's convert it to datetime format for easy indexing**
uber_data['Date/Time'] = pd.to_datetime(uber_data['Date/Time'])

**Let us divide each hour in existing Date/Time column into four smaller bins of 15 mins each:**

[0mins -15mins], [15mins - 30mins], [30mins - 45mins] and [45mins - 60mins]

This will allow us to visualize the time series more precisely.

#create a new column to store this new binned column

uber_data['BinnedHour']=uber_data['Date/Time'].dt.floor('15min')

#printing the new column - BinnedHour

uber_data['BinnedHour']

### Visualizing the Dataset

**Let us visualize the total uber rides per day in the month of July 2014**

plt.figure(figsize=(15,8))
uber_data['BinnedHour'].dt.day.value_counts().sort_index().plot(kind='bar',color='green')
for item in plt.gca().get_xticklabels():
item.set_rotation(45)
plt.title('Uber Rides per day in July 2014 at NYC')
plt.xlabel('Days')
_=plt.ylabel('Rides')

**Observe the nearly recurring pattern in the data!. It is very noticable after day 11.**

**Let us have a more closer look at it, say every 15 minutes from July 1 to July 31.**

plt.figure(figsize=(15,8))
uber_data['BinnedHour'].value_counts().sort_index().plot(c='darkblue',alpha=0.8)
plt.title('Uber Rides every 15 mins in the month of July at NYC')
plt.xlabel('Days')
_=plt.ylabel('No. of Rides')

**The underlying trend is clearly visible now. It conveys that in a day there are times when the pickups
are very low and very high, and they seem to follow a pattern.**

**Q) Which times correspond to the highest and lowest peaks in the plot above?**

uber_data['BinnedHour'].value_counts()

**The highest peak corresponds to the time 19:15(7:15 PM), 15th July 2014 and has a ride count of
915 and the lowest peak corresponds to the time 02:30, 7th July 2014 and has a ride count of 10**

**Now, Lets visualize the week wise trends in the data. For it, we have to map each date into its day
name using a dictionary**

#defining a dictionary to map the weekday to day name

DayMap={0:'Monday', 1:'Tuesday', 2:'Wednesday', 3:'Thursday', 4:'Friday', 5:'Saturday', 6:'Sunday'}
uber_data['Day']=uber_data['BinnedHour'].dt.weekday.map(DayMap)

#Separating the date to another column

uber_data['Date']=uber_data['BinnedHour'].dt.date

#Defining ordered category of week days for easy sorting and visualization
uber_data['Day']=pd.Categorical(uber_data['Day'],categories=['Monday','Tuesday','Wednesday','Thursd
ay','Friday','Saturday','Sunday'],ordered=True)
#Separating time from the "BinnedHour" Column
uber_data['Time']=uber_data['BinnedHour'].dt.time

Rearranging the dataset for weekly analysis

weekly_data =
uber_data.groupby(['Date','Day','Time']).count().dropna().rename(columns={'BinnedHour':'Rides'})
['Rides'].reset_index()
weekly_data.head(10)

**Grouping weekly_data by days to plot total rides per week in july 2014.**

#Grouping the weekly_data daywise

daywise = weekly_data.groupby('Day').sum()
daywise

#Plotting the graphs for a better visualization

sns.set_style("dark")
plt.figure(figsize=(12,12))

#Creating a customized color palette for custom hue according to height of bars
vals = daywise.to_numpy().ravel()
normalized = (vals - np.min(vals)) / (np.max(vals) - np.min(vals))
indices = np.round(normalized * (len(vals) - 1)).astype(np.int32)
palette = sns.color_palette('Reds', len(vals))
colorPal = np.array(palette).take(indices, axis=0)

#Creating a bar plot

ax=sns.barplot(x = daywise.index,y= vals,palette=colorPal)
plt.ylabel('Total rides')
plt.title('Total Rides by week day in July 2014 at NYC')
for rect in ax.patches:
ax.text(rect.get_x() + rect.get_width()/2.0,rect.get_height(),int(rect.get_height()), ha='center',
va='bottom')

**According to the bar plot above, rides are maximum on Thursdays and minimum on Sundays.
Sundays having the lowest number of rides makes sense logically, as it's a holiday and people often
take rest on that day.**

weekly_data = weekly_data.groupby(['Day','Time']).mean()['Rides']
weekly_data.head(10)

#Unstacking the data to create heatmap

weekly_data= weekly_data.unstack(level=0)
weekly_data

plt.figure(figsize=(15,15))
sns.heatmap(weekly_data,cmap='Greens')
_=plt.title('Heatmap of average rides in time vs day grid')

**The heatmap indicates that the maximum average uber rides occur around 5:30PM to 6:15PM on
Wednesdays and Thursdays and their values fall between 550 to 620.**

Here is another way of looking at it:

plt.figure(figsize=(15,12))
weekly_data.plot(ax=plt.gca())
_=plt.title('Average rides per day vs time')
_=plt.ylabel('Average rides')
plt.locator_params(axis='x', nbins=10)

Finding average rides on any day

plt.figure(figsize=(15,10))
weekly_data.T.mean().plot(c = 'black')
_=plt.title('Average uber rides on any day in July 2014 at NYC')
plt.locator_params(axis='x', nbins=10)

**This plot further confirms that the average rides on any given day is lowest around 2 AM and highest
in the around 5:30 PM.**

**Now, let's try visualizing the relationship between Base and total number of rides in July 2014:**

#A mapper to map base number with its name

BaseMapper={'B02512' : 'Unter', 'B02598' : 'Hinter', 'B02617' : 'Weiter', 'B02682' :
'Schmecken','B02764' : 'Danach-NY'}

#Count plot of Base

plt.figure(figsize=(12,10))
sns.set_style("dark")
_=sns.countplot(x=uber_data['Base'].map(BaseMapper))
plt.ylabel('Total rides')
_=plt.title('CountPlot: Total uber rides vs Base - July 2014, NYC')

**The above plot tells us that most uber rides originated from Weiter Base and least from Danach-
NY**

**To know more about the distribution of latitudes and longitudes, let's plot their histograms along
with KDEs**

plt.figure(figsize=(10,10))
sns.histplot(uber_data['Lat'], bins='auto',kde=True,color='r',alpha=0.4,label = 'latitude')
plt.legend(loc='upper right')
plt.xlabel('Latitude')
plt.twiny()
sns.histplot(uber_data['Lon'], bins='auto',kde=True,color='g',alpha=0.4,label = 'longitude')
_=plt.legend(loc='upper left')
_=plt.xlabel('Longitude')
_=plt.title('Distribution of Latitude and Longitude')

**Most latitudes are around 40.25, and longitudes around 40.75. This is true as the dataset comprises
information only around New York City. This also indicates that most rides happen around (lat,lon) =
(40.25,40.75)**

Let's display the latitude - longitude information in 2D:

plt.figure(figsize=(12,12))
sns.scatterplot(x='Lat',y='Lon',data=uber_data,edgecolor='None',alpha=0.5,color='darkblue')
plt.xlabel('Latitude')
plt.ylabel('Longitude')
_=plt.title('Latitude - Longitude Scatter Plot')

**The dark blue area in the center shows the regions in New York City that had most number of uber
rides in July 2014. The plot is better understood when a geographical map is placed underneath**

**Let's use geopy to calculate the distance between Metropolitan Museum and Emperical State
Building**
#This is an example of using geopy
metro_art_coordinates = (40.7794,-73.9632)
empire_state_building_coordinates = (40.7484,-73.9857)

distance = geopy.distance.distance(metro_art_coordinates,empire_state_building_coordinates)

print("Distance = ",distance)

**Using geopy on a larger dataset may be time consuming on slower PC's. Hence let's use the
haversine method**

def haversine(coordinates1,coordinates2):

lat1=coordinates1[0]
lon1=coordinates1[1]
lat2=coordinates2[0]
lon2=coordinates2[1]

#convert to radians and apply haverson formula

lon1,lat1,lon2,lat2 = map(radians,[lon1,lat1,lon2,lat2])
dlon = lon2 - lon1
dlat = lat2 - lat1

a = sin(dlat/2)**2 + cos(lat1)*cos(lat2)*sin(dlon/2)**2
c = 2*asin(sqrt(a))
r = 3956
return c*r
print("Distance (mi) = ",haversine(metro_art_coordinates,empire_state_building_coordinates))

**Now, Let's try to predict which place they are more closer to, say MM or ESB. This can be done by
individually calculating the distance between each uber ride coordinates with MM or ESB coordinates.
If they are found to be in a particular threshold radius with MM, then we can predict that the ride is
going to MM. Similarly for ESB.**

#calculating distance to MM and ESB for each point in the dataset

uber_data['Distance MM'] = uber_data[['Lat','Lon']].apply(lambda x:
haversine(metro_art_coordinates,tuple(x)),axis=1)
uber_data['Distance ESB'] = uber_data[['Lat','Lon']].apply(lambda x:
haversine(empire_state_building_coordinates,tuple(x)),axis=1)

#printing the first 10 elements of the updated dataset

uber_data.head(10)

#Now, let's keep a threshold of 0.25 miles and calculate the number of points that are closer to MM and
ESB
#according to these thresholds

print((uber_data[['Distance MM','Distance ESB']]<0.25).sum())

**The result above shows the number of rides predicted to MM and ESB**

distance_range = np.arange(0.1,5.1,0.1)

distance_data = [(uber_data[['Distance MM','Distance ESB']] < dist).sum() for dist in distance_range]

distance_data

#concatentate and transpose

distance_data = pd.concat(distance_data,axis=1)
distance_data = distance_data.T

#Shifting index
distance_data.index = distance_range

distance_data=distance_data.rename(columns={'Distance MM':'CloserToMM','Distance
ESB':'CloserToESB'})

plt.figure(figsize=(12,12))
distance_data.plot(ax=plt.gca())
plt.title('Number of Rides Closer to ESB and MM')
plt.xlabel('Threshold Radius(mi)')
plt.ylabel('Rides')

**The number of riders to MM and ESB initially diverges, but comes closer as threshold increases.
Hence as radius increases, the rate of people going towards MM gets higher than that to ESB. In
another way of thinking, as we expand the radius, most of the newly discovered rides are going to
MM.**

**Now let us observe the heatmap plotted on geographical map (using folium)**

#initilize the map around NYC and set the zoom level to 10
uber_map = folium.Map(location=metro_art_coordinates,zoom_start=10)

#lets mark MM and ESB on the map

folium.Marker(metro_art_coordinates,popup = "MM").add_to(uber_map)
folium.Marker(empire_state_building_coordinates,popup = "ESB").add_to(uber_map)

#convert to numpy array and plot it

Lat_Lon = uber_data[['Lat','Lon']].to_numpy()
folium.plugins.HeatMap(Lat_Lon,radius=10).add_to(uber_map)

#Displaying the map

uber_map

**Lets reduce the "Influence" of each point on the heatmap by using a weight of 0.5 (by default it is
1)**

uber_data['Weight']=0.5

#Take on 10000 points to plot (Just to speed up things)

Lat_Lon = uber_data[['Lat','Lon','Weight']].to_numpy()

#Plotting
uber_map = folium.Map(metro_art_coordinates,zoom_start=10)
folium.plugins.HeatMap(Lat_Lon,radius=15).add_to(uber_map)
uber_map

**The plot looks easy to visualize now. Boundaries and intensity distribution is clear**

**Let's now create a HeatMap that changes with time. This will help us to visualize the number of uber
rides geographically at a given time.**

**We are plotting only the points that are in a radius of 0.25 miles from MM or ESB**

i = uber_data[['Distance MM','Distance ESB']] < 0.25

i.head(10)
#Create a boolean mask to choose the rides that satisfy the 0.25 radius threshold
i=i.any(axis=1)

i[i==True]

#Create a copy of the data

map_data = uber_data[i].copy()

#use a smaller weight

map_data['Weight'] = 0.1

#Restricting data to that before 8th july for faster calculations

map_data = uber_data[uber_data["BinnedHour"] < datetime.datetime(2014,7,8)].copy()

#Generate samples for each timestamp in "BinnedHour" (these are the points that are plotted for each
timestamp)
map_data = map_data.groupby("BinnedHour").apply(lambda x:
x[['Lat','Lon','Weight']].sample(int(len(x)/3)).to_numpy().tolist())

map_data

#The index to be passed on to heatmapwithtime needs to be a time series of the following format
data_hour_index = [x.strftime("%m%d%Y, %H:%M:%S") for x in map_data.index]

#convert to list to feed it to heatmapwithtime

date_hour_data = map_data.tolist()

#initialize map
uber_map = folium.Map(location=metro_art_coordinates,zoom_start=10)

#plotting
hm = folium.plugins.HeatMapWithTime(date_hour_data,index=date_hour_data)

#add heatmap to folium map(uber_map)

hm.add_to(uber_map)
uber_map

Click the play button to visualize the timeseries

uber_data

weekends = weekly_data[['Saturday','Sunday']]

weekdays = weekly_data.drop(['Saturday','Sunday'],axis=1)

weekends = weekends.mean(axis=1)
weekdays = weekdays.mean(axis=1)

weekdays_weekends = pd.concat([weekdays,weekends],axis=1)
weekdays_weekends.columns = ['Weekdays','Weekends']

weekdays_weekends

plt.figure(figsize=(15,10))
weekdays_weekends.plot(ax=plt.gca())
weekly_data.T.mean().plot(ax=plt.gca(),c = 'black',label='Net Average')
_=plt.title('Time Averaged Rides: Weekend, Weekdays, Net Average (Whole July)')
_=plt.legend()
**The Net average plot is more similar to the weekdays average because there are more weekdays than
weekends.**

**In early morning, weekends have more rides. This makes sense as people often go out at night during
the weekends.**

**The number of rides around 8 AM is less on weekends, but more on weekdays as it is usually the
time when people goto work. Also, in the weekends, there is a surge in the number of evening rides as
people return from work.**

**Let us normalize the weekday and weekends data with their own respective sums. This will give us
an insight into the proportional data and help us answer questions like - "What percentage of rides
happened around 12AM on weekends or weekdays"?**

plt.figure(figsize=(15,10))
(weekdays_weekends/weekdays_weekends.sum()).plot(ax=plt.gca())
_=plt.title('Time Averaged Rides (Normalized) - Weekend, Weekdays')

**Nearly 1.5% of the total rides on weekends happen at midnight but only 0.5% of the total rides
happen on weekdays!**
**Also, nearly 2% of the total rides on weekdays happen around 5:30PM!**

**So far, we have made our observations by eye. Let us do a statistical T test to compare the time-
averaged rides on weekdays and weekends**

#Grouping by date and time and creating a dataset that gives the total rides every 15 mins
for_ttest = uber_data.groupby(['Date','Time']).count()['Day'].reset_index(level=1)

#Total rides on each day in july

uber_data.groupby(['Date']).count()['Day']

#Normalizing the dataset by dividing rides in each time slot on a day by total number of rides on that
day
for_ttest = pd.concat([for_ttest['Day']/uber_data.groupby(['Date']).count()
['Day'],for_ttest['Time']],axis=1)

#renaming
for_ttest=for_ttest.rename(columns={'Day':'NormalizedRides'})

for_ttest

for_ttest = pd.concat([for_ttest,pd.to_datetime(for_ttest.reset_index()
['Date']).dt.day_name().to_frame().set_index(for_ttest.index).rename(columns={'Date':'Day'})],axis=1)

#uber_data.groupby(['Date','Time','Day']).count().dropna().reset_index()
[['Date','Day']].set_index('Date')

for_ttest

**The rides are first normalized by dividing the number of rides in each time slot by the total number
of rides on that day**

**Then they are grouped by time and split to weekend and weekdays data and a T test is applied on
them.**
**A Null hypothesis is assumed: The average ride counts are similar for each time slot on weekends
and weekdays**

ttestvals = for_ttest.groupby('Time').apply(lambda x: ttest_ind(x[x['Day']<'Saturday']

['NormalizedRides'],x[x['Day']>='Saturday']['NormalizedRides']))

ttestvals=pd.DataFrame(ttestvals.to_list(),index = ttestvals.index)

ttestvals

**The t-statistic value is -11.5 around midnight! This means that the assumption(hypothesis) does not
hold at that time. The pvalue is very low, hence the null hypthesis is rejected around midnight**

Let's plot and see the values for all timeslots

**if we hold a p-value threshold of 5% (confidence level = 95%), corresponding t-statistic value is
1.96**

#Let's plot the "statistic" column

plt.figure(figsize=(15,12))
ax=ttestvals['statistic'].plot(kind='barh',color='red',ax=plt.gca())
plt.locator_params(axis='y', nbins=40)
plt.locator_params(axis='x', nbins=10)
plt.xlabel('t-statistic')
plt.axvline(x=1.96,alpha=0.5,color='black',linestyle='--')
plt.axvline(x=-1.96,alpha=0.5,color='black',linestyle='--')

for rect in ax.patches:

if(abs(rect.get_width())<1.96):
rect.set_color('green')
_=plt.title('Bar plot of tstatistic')

**The time-average ride counts are assumed similar on weekdays and weekends if the width of the bar
plot is less than 1.96. Such values are colored in green.**

Note that their count is very low

Let's visualize a KDE plot of the pvalue to confirm this:

#KDE plot
plt.figure(figsize=(8,8))
ttestvals['pvalue'].plot(kind='kde',color='darkblue',ax=plt.gca())
plt.title('KDE plot - P_value')
_=plt.xlabel('p_value')

**Density peaks around p_value=0. Hence it confirms that the time-averaged rides vary greatly at most
time slots on weekends and weekdays**

**P-value distribution:**

plt.figure(figsize=(12,10))
ax=ttestvals['pvalue'].plot(kind='line',color='black',ax=plt.gca())
plt.axhline(y=0.05,alpha=0.5,color='black',linestyle='--')
plt.locator_params(axis='x',nbins=20)
for item in plt.gca().get_xticklabels():
item.set_rotation(45)

_=plt.title('Time vs P_value')
_=plt.ylabel('P_value')
**The threshold is p = 0.05. The null hypothesis is accepted at p_values below 0.05**

## Checking Relations in Data

uber_data

#create a copy
df = uber_data.copy()

#get numbers of each weekday

df['WeekDay']=df['Date/Time'].dt.weekday

#Convert datetime to float. egs: 1:15AM will be 1.25, 12:45 will be 12.75 etc
def func(x):
hr = float(x.hour)
minute = int(x.minute/15)
return hr + minute/4
df['Time']=df['Date/Time'].apply(func)

#Get the day number, removing month and year

df['Day']=df['Date/Time'].dt.day

#Remove unwanted columns that were created for visualization

df = df.drop(['Date/Time','BinnedHour','Date','Distance MM','Distance ESB','Lat','Lon'],axis=1)

#create a redundant columns for easy counting of tolal rides

df['DropMe']=1

#count the number of rides for a given day, weekday number, time and base
df = df.groupby(['Day','WeekDay','Time','Base']).count()
['DropMe'].reset_index().rename(columns={'DropMe':'Rides'})

#Weekends are given special emphasis, as their trends were very different from that on weekdays.
#so we devote a special columns indicating whether the day is weekday or not
df['Weekend']=df.apply(lambda x: 1 if(x['WeekDay']>4) else 0,axis=1)

Let's visualize a pairplot

sns.pairplot(df,hue='Base')

Notice the clusters in data! Especially time-rides, day-rides.

Let's create a jointplot of Rides vs Time

plt.figure()
_=sns.jointplot(x='Rides',y='Time',data = df,hue='Base')

CHAPTER 6: RESULTS
6.1 Screenshots

i. Importing packages
ii. Data Collection

iii. Data preprocessing

iv. Feature Extraction

v. Training and Testing

vi. Evaluation model

CHAPTER 7: TESTING
7.1 Testing
Software testing is an investigation conducted to provide stakeholders with
information about the quality of the product or service under test. Software Testing
also provides an objective, independent view of the software to allow the business to
appreciate and understand the risks at implementation of the software. Test techniques
include, but are not limited to, the process of executing a program or application with
the intent of finding software bugs.
Software Testing can also be stated as the process of validating and verifying that a
software program/application/product:
 Meets the business and technical requirements that guided its design and
Development.
 Works as expected and can be implemented with the same characteristics.
TESTING METHODS
Functional Testing

Functional tests provide systematic demonstrations that functions tested are available
as specified by the business and technical requirements, system documentation, and
user manuals.
Functional testing is centered on the following items:
 Functions: Identified functions must be exercised.
 Output: Identified classes of software outputs must be exercised.
 Systems/Procedures: system should work properly

Integration Testing

Software integration testing is the incremental integration testing of two or more

integrated software components on a single platform to produce failures caused by
interface defects.

Test Case for Excel Sheet Verification:

Here in machine learning we are dealing with dataset which is in excel sheet format
so if any test case we need means we need to check excel file. Later on classification
will work on the respective columns of dataset .
Test Case 1 :

RESULTS
The program predicts the pickup location of the cab based on the centroids plotted
using applied by k- means clustering for appropriate cab scheduled for pickup.

CHAPTER 8: CONCLUSION
8.1 Conclusion

The conclusion of the project is to project a basic outline of trips travelled with
respect to latitude and longitude of locations and pinpoint the locations travelled with
respect to the frequency of trips travelled by a uber cab during the day and also based
on the cross analyzing of the dataset based on the latitude and longitude of the point
travelled by the cab which is then analyzed by deploying k- means clustering which
classifies the locations on the basis of centroids and then orders the frequency of trips
based on labels or clusters. By the location given by the user, the algorithm predicts
the cluster nearest to the location so that cab can be assigned to the user for pickup.
CHAPTER 9: FUTURE ENHANCEMENTS

9.1 Future enhancements

The future work suggests that the system will provide the location to the user. The
algorithm then records the time, latitude, longitude of the trip and assigns it to a
cluster nearest to the passenger location where a cab is scheduled for pickup. We can
also predict the passenger count on each district to deploy more cabs to the clustered
coordinates using convolutional neural networks (CNN).
REFERENCES

[1] Poulsen, L.K., Dekkers, D., Wagenaar, N., Snijders, W., Lewinsky, B.,
Mukkamala, R.R.andVatrapu, R., 2016, June. Green Cabs vs. Uber in New York City.
In 2016 IEEEInternational Congress on Big Data (BigData Congress) (pp. 222-229).
IEEE.
[2] Faghih, S.S., Safikhani, A., Moghimi, B. and Kamga, C., 2017. Predicting Short-
Term UberDemand Using Spatio-Temporal Modeling: A New York City Case Study.
arXiv preprintarXiv:1712.02001.
[3] Guha, S. and Mishra, N., 2016. Clustering data streams. In Data stream
management (pp.169-187). Springer, Berlin, Heidelberg.
[4] Ahmed, M., Johnson, E.B. and Kim, B.C., 2018. The Impact of Uber and Lyft on
TaxiService Quality Evidence from New York City. Available at SSRN 3267082.
[5] Wallsten, S., 2015. The competitive effects of the sharing economy: how is Uber
changingtaxis. Technology Policy Institute, 22, pp.1-21.
[6] Sotiropoulos, D.N., Pournarakis, D.E. and Giaglis, G.M., 2016, July. A genetic
algorithmapproach for topic clustering: A centroid- based encoding scheme. In 2016
7th InternationalConference on Information, Intelligence, Systems & Applications
(IISA) (pp. 1-8). IEEE
[7] Faghih, S.S., Safikhani, A., Moghimi, B. and Kamga, C., 2019. Predicting Short-
TermUber Demand in New York City Using Spatiotemporal Modeling. Journal of
Computing inCivil Engineering, 33(3), p.05019002.
[8] Shah, D., Kumaran, A., Sen, R. and Kumaraguru, P., 2019, May. Travel Time
EstimationAccuracy in Developing Regions: An Empirical Case Study with Uber
Data in Delhi-NCR✱.In Companion Proceedings of The 2019 World Wide Web
Conference (pp. 130-136). ACM.
[9] Kumar, A., Surana, J., Kapoor, M. and Nahar, P.A., CSE 255 Assignment II
PerfectingPassenger Pickups: An Uber Case Study.
[10] L.Liu, C.Andris, and C.Ratti , “Uncovering cabdrivers behaviour patterns from
their digital traces”,Compu. Environ.UrbanSyst.,vol.34,no.6,pp.541-548,2010
[11] R.H.Hwang,Y.L.Hsueh , and Y.T.Chen,”An effective taxi recommender system
model on a spatio-temporal factor analysis model,”Inf.Sci.,vol.314,pp.28-40,2015.
[12] Vigneshwari, S., and M. Aramudhan. "Web information extraction on multiple
ontologies based on concept relationships upon training the user profiles." In
Artificial Intelligence and Evolutionary Algorithms in Engineering Systems, pp. 1-8.
Springer, New Delhi, 2015.
[13] L. Rayle, D. Dai, N. Chan, R. Cervero, and S. Shaheen, “Just a better taxi? a
survey-based comparison of taxis, transit, and ridesourcing services in san francisco,”
Transport Policy, vol. 45, 01 2016.
[14] O. Flores and L. Rayle, “How cities use regulation for innovation: the case of
uber, lyft and sidecar in san francisco,” Transportation research procedia, vol. 25, pp.
3756–3768, 2017.
[15] H. A. Chaudhari, J. W. Byers, and E. Terzi, “Putting data in the driver’s seat:
Optimizing earnings for on-demand ride-hailing,” in Proceedings of the Eleventh
ACM International Conference on Web Search and Data Mining. ACM, 2018, pp.
90–98.
LIST OF FIGURES

Name of the Figure

1. System architecture of the model

2. Use case Diagram

3. Class Diagram
4. Sequence Diagram

5. Activity Diagram
Domain Specification
MACHINE LEARNING
Machine Learning is a system that can learn from example through self-
improvement and without being explicitly coded by programmer. The
breakthrough comes with the idea that a machine can singularly learn
from the data (i.e., example) to produce accurate results.
Machine learning combines data with statistical tools to predict an output.
This output is then used by corporate to makes actionable insights.
Machine learning is closely related to data mining and Bayesian
predictive modeling. The machine receives data as input, use an
algorithm to formulate answers.
A typical machine learning tasks are to provide a recommendation. For
those who have a Netflix account, all recommendations of movies or
series are based on the user's historical data. Tech companies are using
unsupervised learning to improve the user experience with personalizing
recommendation.
Machine learning is also used for a variety of task like fraud detection,
predictive maintenance, portfolio optimization, automatize task and so
on.
Machine Learning vs. Traditional Programming
Traditional programming differs significantly from machine learning. In
traditional programming, a programmer code all the rules in consultation
with an expert in the industry for which software is being developed.
Each rule is based on a logical foundation; the machine will execute an
output following the logical statement. When the system grows complex,
more rules need to be written. It can quickly become unsustainable to
maintain.

Machine Learning
How does Machine learning work?
Machine learning is the brain where all the learning takes place. The way
the machine learns is similar to the human being. Humans learn from
experience. The more we know, the more easily we can predict. By
analogy, when we face an unknown situation, the likelihood of success is
lower than the known situation. Machines are trained the same. To make
an accurate prediction, the machine sees an example. When we give the
machine a similar example, it can figure out the outcome. However, like a
human, if its feed a previously unseen example, the machine has
difficulties to predict.
The core objective of machine learning is the learning and inference.
First of all, the machine learns through the discovery of patterns. This
discovery is made thanks to the data. One crucial part of the data scientist
is to choose carefully which data to provide to the machine. The list of
attributes used to solve a problem is called a feature vector. You can
think of a feature vector as a subset of data that is used to tackle a
problem.
The machine uses some fancy algorithms to simplify the reality and
transform this discovery into a model. Therefore, the learning stage is
used to describe the data and summarize it into a model.

For instance, the machine is trying to understand the relationship between

the wage of an individual and the likelihood to go to a fancy restaurant. It
turns out the machine finds a positive relationship between wage and
going to a high-end restaurant: This is the model
Inferring
When the model is built, it is possible to test how powerful it is on never-
seen-before data. The new data are transformed into a features vector, go
through the model and give a prediction. This is all the beautiful part of
machine learning. There is no need to update the rules or train again the
model. You can use the model previously trained to make inference on
new data.
The life of Machine Learning programs is straightforward and can be
summarized in the following points:

 Define a question
 Collect data
 Visualize data
 Train algorithm
 Test the Algorithm
 Collect feedback
 Refine the algorithm
 Loop 4-7 until the results are satisfying
 Use the model to make a prediction

Once the algorithm gets good at drawing the right conclusions, it applies
that knowledge to new sets of data.
Machine learning Algorithms and where they are used?

Machine learning can be grouped into two broad learning tasks:

Supervised and Unsupervised. There are many other algorithms
Supervised learning
An algorithm uses training data and feedback from humans to learn the
relationship of given inputs to a given output. For instance, a practitioner
can use marketing expense and weather forecast as input data to predict
the sales of cans.
You can use supervised learning when the output data is known. The
algorithm will predict new data.
There are two categories of supervised learning:

Algorithm Description Type

Name

Linear Finds a way to correlate each feature to Regression

regression the output to help predict future values.

Logistic Extension of linear regression that's used Classification

regression for classification tasks. The output
variable 3is binary (e.g., only black or
white) rather than continuous (e.g., an
infinite list of potential colors)

Decision Highly interpretable classification or Regression

tree regression model that splits data-feature Classification
values into branches at decision nodes
(e.g., if a feature is a color, each possible
color becomes a new branch) until a final
decision output is made

Naive The Bayesian method is a classification Regression

Bayes method that makes use of the Bayesian Classification
theorem. The theorem updates the prior
knowledge of an event with the
independent probability of each feature
that can affect the event.

Support Support Vector Machine, or SVM, is Regression (not

vector typically used for the classification task. very common)
machine SVM algorithm finds a hyperplane that Classification
optimally divided the classes. It is best
used with a non-linear solver.

Random The algorithm is built upon a decision tree Regression

forest to improve the accuracy drastically. Classification
Random forest generates many times
simple decision trees and uses the
'majority vote' method to decide on which
label to return. For the classification task,
the final prediction will be the one with
the most vote; while for the regression
task, the average prediction of all the trees
is the final prediction.

AdaBoost Classification or regression technique that Regression

uses a multitude of models to come up Classification
with a decision but weighs them based on
their accuracy in predicting the outcome

Gradient- Gradient-boosting trees is a state-of-the- Regression

boosting art classification/regression technique. It Classification
trees is focusing on the error committed by the
previous trees and tries to correct it.

 Classification task
 Regression task

Classification
Imagine you want to predict the gender of a customer for a commercial.
You will start gathering data on the height, weight, job, salary,
purchasing basket, etc. from your customer database. You know the
gender of each of your customer, it can only be male or female. The
objective of the classifier will be to assign a probability of being a male
or a female (i.e., the label) based on the information (i.e., features you
have collected). When the model learned how to recognize male or
female, you can use new data to make a prediction. For instance, you just
got new information from an unknown customer, and you want to know
if it is a male or female. If the classifier predicts male = 70%, it means the
algorithm is sure at 70% that this customer is a male, and 30% it is a
female.
The label can be of two or more classes. The above example has only two
classes, but if a classifier needs to predict object, it has dozens of classes
(e.g., glass, table, shoes, etc. each object represents a class)
Regression
When the output is a continuous value, the task is a regression. For
instance, a financial analyst may need to forecast the value of a stock
based on a range of feature like equity, previous stock performances,
macroeconomics index. The system will be trained to estimate the price
of the stocks with the lowest possible error.
Unsupervised learning
In unsupervised learning, an algorithm explores input data without being
given an explicit output variable (e.g., explores customer demographic
data to identify patterns)
You can use it when you do not know how to classify the data, and you
want the algorithm to find patterns and classify the data for you

Algorithm Description Type

K-means Puts data into some groups (k) that each Clustering
clustering contains data with similar
characteristics (as determined by the
model, not in advance by humans)

Gaussian A generalization of k-means clustering Clustering

mixture model that provides more flexibility in the size
and shape of groups (clusters

Hierarchical Splits clusters along a hierarchical tree Clustering

clustering to form a classification system.

Can be used for Cluster loyalty-card

customer

Recommender Help to define the relevant data for Clustering

system making a recommendation.

PCA/T-SNE Mostly used to decrease the Dimension

dimensionality of the data. The Reduction
algorithms reduce the number of
features to 3 or 4 vectors with the
highest variances.

Application of Machine learning

Augmentation:

Machine learning, which assists humans with their day-to-day

tasks, personally or commercially without having complete control
of the output. Such machine learning is used in different ways such
as Virtual Assistant, Data analysis, software solutions. The primary
user is to reduce errors due to human bias.

Automation:

Machine learning, which works entirely autonomously in any field

without the need for any human intervention. For example, robots
performing the essential process steps in manufacturing plants.

Finance Industry

Machine learning is growing in popularity in the finance industry.

Banks are mainly using ML to find patterns inside the data but also
to prevent fraud.

Government organization

The government makes use of ML to manage public safety and

utilities. Take the example of China with the massive face
recognition. The government uses Artificial intelligence to prevent
jaywalker.

Healthcare industry

Healthcare was one of the first industry to use machine learning

with image detection.

Marketing

Broad use of AI is done in marketing thanks to abundant access to

data. Before the age of mass data, researchers develop advanced
mathematical tools like Bayesian analysis to estimate the value of a
customer. With the boom of data, marketing department relies on
AI to optimize the customer relationship and marketing campaign.
Example of application of Machine Learning in Supply Chain
Machine learning gives terrific results for visual pattern recognition,
opening up many potential applications in physical inspection and
maintenance across the entire supply chain network.
Unsupervised learning can quickly search for comparable patterns in the
diverse dataset. In turn, the machine can perform quality inspection
throughout the logistics hub, shipment with damage and wear.
For instance, IBM's Watson platform can determine shipping container
damage. Watson combines visual and systems-based data to track, report
and make recommendations in real-time.
In past year stock manager relies extensively on the primary method to
evaluate and forecast the inventory. When combining big data and
machine learning, better forecasting techniques have been implemented
(an improvement of 20 to 30 % over traditional forecasting tools). In term
of sales, it means an increase of 2 to 3 % due to the potential reduction in
inventory costs.
Example of Machine Learning Google Car
For example, everybody knows the Google car. The car is full of lasers
on the roof which are telling it where it is regarding the surrounding area.
It has radar in the front, which is informing the car of the speed and
motion of all the cars around it. It uses all of that data to figure out not
only how to drive the car but also to figure out and predict what potential
drivers around the car are going to do. What's impressive is that the car is
processing almost a gigabyte a second of data.

Deep Learning
Deep learning is a computer software that mimics the network of neurons
in a brain. It is a subset of machine learning and is called deep learning
because it makes use of deep neural networks. The machine uses different
layers to learn from the data. The depth of the model is represented by the
number of layers in the model. Deep learning is the new state of the art in
term of AI. In deep learning, the learning phase is done through a neural
network.
Reinforcement Learning
Reinforcement learning is a subfield of machine learning in which
systems are trained by receiving virtual "rewards" or "punishments,"
essentially learning by trial and error. Google's DeepMind has used
reinforcement learning to beat a human champion in the Go games.
Reinforcement learning is also used in video games to improve the
gaming experience by providing smarter bot.
One of the most famous algorithms are:

 Q-learning
 Deep Q network
 State-Action-Reward-State-Action (SARSA)
 Deep Deterministic Policy Gradient (DDPG)

Applications/ Examples of deep learning applications

AI in Finance: The financial technology sector has already started using
AI to save time, reduce costs, and add value. Deep learning is changing
the lending industry by using more robust credit scoring. Credit decision-
makers can use AI for robust credit lending applications to achieve faster,
more accurate risk assessment, using machine intelligence to factor in the
character and capacity of applicants.
Underwrite is a Fintech company providing an AI solution for credit
makers company. underwrite.ai uses AI to detect which applicant is more
likely to pay back a loan. Their approach radically outperforms traditional
methods.
AI in HR: Under Armour, a sportswear company revolutionizes hiring
and modernizes the candidate experience with the help of AI. In fact,
Under Armour Reduces hiring time for its retail stores by 35%. Under
Armour faced a growing popularity interest back in 2012. They had, on
average, 30000 resumes a month. Reading all of those applications and
begin to start the screening and interview process was taking too long.
The lengthy process to get people hired and on-boarded impacted Under
Armour's ability to have their retail stores fully staffed, ramped and ready
to operate.
At that time, Under Armour had all of the 'must have' HR technology in
place such as transactional solutions for sourcing, applying, tracking and
onboarding but those tools weren't useful enough. Under armour
choose HireVue, an AI provider for HR solution, for both on-demand
and live interviews. The results were bluffing; they managed to decrease
by 35% the time to fill. In return, the hired higher quality staffs.

AI in Marketing: AI is a valuable tool for customer service management

and personalization challenges. Improved speech recognition in call-
center management and call routing as a result of the application of AI
techniques allows a more seamless experience for customers.
For example, deep-learning analysis of audio allows systems to assess a
customer's emotional tone. If the customer is responding poorly to the AI
chatbot, the system can be rerouted the conversation to real, human
operators that take over the issue.
Apart from the three examples above, AI is widely used in other
sectors/industries.
Artificial Intelligence
Difference between Machine Learning and Deep Learning
Machine Learning Deep Learning

Data Excellent performances Excellent performance on

Dependencies on a small/medium a big dataset
dataset

Hardware Work on a low-end Requires powerful

dependencies machine. machine, preferably with
GPU: DL performs a
significant amount of
matrix multiplication

Feature Need to understand the No need to understand the

engineering features that represent the best feature that
data represents the data

Execution time From few minutes to Up to weeks. Neural

hours Network needs to
compute a significant
number of weights

Interpretability Some algorithms are easy Difficult to impossible

to interpret (logistic,
decision tree), some are
almost impossible (SVM,
XGBoost)

When to use ML or DL?

In the table below, we summarize the difference between machine
learning and deep learning.

Machine learning Deep learning

Training dataset Small Large

Choose features Yes No

Number of algorithms Many Few

Training time Short Long

With machine learning, you need fewer data to train the algorithm than
deep learning. Deep learning requires an extensive and diverse set of data
to identify the underlying structure. Besides, machine learning provides a
faster-trained model. Most advanced deep learning architecture can take
days to a week to train. The advantage of deep learning over machine
learning is it is highly accurate. You do not need to understand what
features are the best representation of the data; the neural network learned
how to select critical features. In machine learning, you need to choose
for yourself what features to include in the model.

TensorFlow
the most famous deep learning library in the world is Google's
TensorFlow. Google product uses machine learning in all of its products
to improve the search engine, translation, image captioning or
recommendations.
To give a concrete example, Google users can experience a faster and
more refined the search with AI. If the user types a keyword a the search
bar, Google provides a recommendation about what could be the next
word.
Google wants to use machine learning to take advantage of their massive
datasets to give users the best experience. Three different groups use
machine learning:

 Researchers
 Data scientists
 Programmers.

They can all use the same toolset to collaborate with each other and
improve their efficiency.
Google does not just have any data; they have the world's most massive
computer, so TensorFlow was built to scale. TensorFlow is a library
developed by the Google Brain Team to accelerate machine learning and
deep neural network research.
It was built to run on multiple CPUs or GPUs and even mobile operating
systems, and it has several wrappers in several languages like Python, C+
+ or Java.
In this tutorial, you will learn
TensorFlow Architecture
Tensor flow architecture works in three parts:

 Pre processing the data

 Build the model
 Train and estimate the model

It is called Tensor flow because it takes input as a multi-dimensional

array, also known as tensors. You can construct a sort of flowchart of
operations (called a Graph) that you want to perform on that input. The
input goes in at one end, and then it flows through this system of multiple
operations and comes out the other end as output.
This is why it is called TensorFlow because the tensor goes in it flows
through a list of operations, and then it comes out the other side.
Where can Tensor flow run?
TensorFlow can hardware, and software requirements can be classified
into
Development Phase: This is when you train the mode. Training is usually
done on your Desktop or laptop.
Run Phase or Inference Phase: Once training is done Tensorflow can be
run on many different platforms. You can run it on

 Desktop running Windows, macOS or Linux

 Cloud as a web service
 Mobile devices like iOS and Android

You can train it on multiple machines then you can run it on a different
machine, once you have the trained model.
The model can be trained and used on GPUs as well as CPUs. GPUs were
initially designed for video games. In late 2010, Stanford researchers
found that GPU was also very good at matrix operations and algebra so
that it makes them very fast for doing these kinds of calculations. Deep
learning relies on a lot of matrix multiplication. TensorFlow is very fast
at computing the matrix multiplication because it is written in C++.
Although it is implemented in C++, TensorFlow can be accessed and
controlled by other languages mainly, Python.
Finally, a significant feature of Tensor Flow is the Tensor Board. The
Tensor Board enables to monitor graphically and visually what
TensorFlow is doing.
List of Prominent Algorithms supported by TensorFlow

 Linear regression: tf. estimator .Linear Regressor

 Classification :tf. Estimator .Linear Classifier
 Deep learning classification: tf. estimator. DNN Classifier
 Booster tree regression: tf.estimator.BoostedTreesRegressor
 Boosted tree classification: tf.estimator.BoostedTreesClassifier

PYTHON OVERVIEW
Python is a high-level, interpreted, interactive and object-oriented
scripting language. Python is designed to be highly readable. It uses
English keywords frequently where as other languages use punctuation,
and it has fewer syntactical constructions than other languages.
 Python is Interpreted: Python is processed at runtime by the
interpreter. You do not need to compile your program before
executing it. This is similar to PERL and PHP.
 Python is Interactive: You can actually sit at a Python prompt and
interact with the interpreter directly to write your programs.
 Python is Object-Oriented: Python supports Object-Oriented style
or technique of programming that encapsulates code within objects.
 Python is a Beginner's Language: Python is a great language for the
beginner-level programmers and supports the development of a wide
range of applications from simple text processing to WWW browsers
to games.

History of Python

Python was developed by Guido van Rossum in the late eighties and
early nineties at the National Research Institute for Mathematics and
Computer Science in the Netherlands.

Python is derived from many other languages, including ABC, Modula-3,

C, C++, Algol-68, SmallTalk, Unix shell, and other scripting languages.

Python is copyrighted. Like Perl, Python source code is now available

under the GNU General Public License (GPL).

Python is now maintained by a core development team at the institute,

although Guido van Rossum still holds a vital role in directing its
progress.

Python Features
Python's features include:

Easy-to-learn: Python has few keywords, simple structure, and a clearly

defined syntax. This allows the student to pick up the language quickly.

Easy-to-read: Python code is more clearly defined and visible to the

eyes.

Easy-to-maintain: Python's source code is fairly easy-to-maintain.

A broad standard library: Python's bulk of the library is very portable

and cross-platform compatible on UNIX, Windows, and Macintosh.

Interactive Mode: Python has support for an interactive mode which

allows interactive testing and debugging of snippets of code.

Portable: Python can run on a wide variety of hardware platforms and

has the same interface on all platforms.

Extendable: You can add low-level modules to the Python interpreter.

These modules enable programmers to add to or customize their tools to
be more efficient.
Databases: Python provides interfaces to all major commercial
databases.

GUI Programming: Python supports GUI applications that can be

created and ported to many system calls, libraries, and windows systems,
such as Windows MFC, Macintosh, and the X Window system of Unix.

Scalable: Python provides a better structure and support for large

programs than shell scripting.

Apart from the above-mentioned features, Python has a big list of good
features, few are listed below:
 IT supports functional and structured programming methods as well
as OOP.
 It can be used as a scripting language or can be compiled to byte-code
for building large applications.
 It provides very high-level dynamic data types and supports dynamic
type checking.
 IT supports automatic garbage collection.
 It can be easily integrated with C, C++, COM, ActiveX, CORBA, and
Java.

Python is available on a wide variety of platforms including Linux and

Mac OS X. Let's understand how to set up our Python environment.

ANACONDA NAVIGATOR
Anaconda Navigator is a desktop graphical user interface (GUI) included
in Anaconda distribution that allows you to launch applications and easily
manage conda packages, environments and channels without using
command-line commands. Navigator can search for packages on
Anaconda Cloud or in a local Anaconda Repository. It is available for
Windows, mac OS and Linux.
Why use Navigator?
In order to run, many scientific packages depend on specific versions
of other packages. Data scientists often use multiple versions of
many packages, and use multiple environments to separate these different
versions.
The command line program conda is both a package manager and an
environment manager, to help data scientists ensure that each version of
each package has all the dependencies it requires and works correctly.
Navigator is an easy, point-and-click way to work with packages and
environments without needing to type conda commands in a terminal
window. You can use it to find the packages you want, install them in an
environment, run the packages and update them, all inside Navigator.
WHAT APPLICATIONS CAN I ACCESS USING NAVIGATOR?
The following applications are available by default in Navigator:
 Jupyter Lab
 Jupyter Notebook
 QT Console
 Spyder
 VS Code
 Glue viz
 Orange 3 App
 Rodeo
 RStudio
Advanced conda users can also build your own Navigator applications
How can I run code with Navigator?
The simplest way is with Spyder. From the Navigator Home tab, click
Spyder, and write and execute your code.
You can also use Jupyter Notebooks the same way. Jupyter Notebooks
are an increasingly popular system that combine your code, descriptive
text, output, images and interactive interfaces into a single notebook file
that is edited, viewed and used in a web browser.
What’s new in 1.9?
 Add support for Offline Mode for all environment related actions.
 Add support for custom configuration of main windows links.
Numerous bug fixes and performance enhancements.

Uber Trip Analysis Machine Learning Project (Data Analyst)
No ratings yet
Uber Trip Analysis Machine Learning Project (Data Analyst)
27 pages
The Price Prediction For Used Cars Using Multiple Linear Regression Model
No ratings yet
The Price Prediction For Used Cars Using Multiple Linear Regression Model
6 pages
Data Mining: Classification & Prediction
No ratings yet
Data Mining: Classification & Prediction
16 pages
Mini Project Stqa Report
No ratings yet
Mini Project Stqa Report
13 pages
Cyber Crime Analysis System Design
No ratings yet
Cyber Crime Analysis System Design
10 pages
Predicting Hourly Boarding Demand of Bus Passengers 3.6.2
100% (1)
Predicting Hourly Boarding Demand of Bus Passengers 3.6.2
81 pages
Computer Vision Vehicle Detection
No ratings yet
Computer Vision Vehicle Detection
4 pages
Module2 Ids 240201 162026
No ratings yet
Module2 Ids 240201 162026
11 pages
Data Mining Course Overview and Syllabus
No ratings yet
Data Mining Course Overview and Syllabus
129 pages
ML Mini Project 2
No ratings yet
ML Mini Project 2
26 pages
AI Mini Project
No ratings yet
AI Mini Project
29 pages
Halstead's Operators and Operands Guide
100% (6)
Halstead's Operators and Operands Guide
5 pages
LP3 - ML Mini-Project Report Format Shreeyas
No ratings yet
LP3 - ML Mini-Project Report Format Shreeyas
13 pages
Basepaper Blockchain Question Paper
No ratings yet
Basepaper Blockchain Question Paper
7 pages
Data Analytics and Reporting - Notes Unit 1 and 2
No ratings yet
Data Analytics and Reporting - Notes Unit 1 and 2
11 pages
Skilldzire Report PDF
0% (1)
Skilldzire Report PDF
37 pages
Used Car Price Prediction Project Report
No ratings yet
Used Car Price Prediction Project Report
10 pages
Stqa Viva
No ratings yet
Stqa Viva
10 pages
CSDF Mini Project
No ratings yet
CSDF Mini Project
18 pages
Data Discretization Techniques
No ratings yet
Data Discretization Techniques
21 pages
Combining Classifiers in Machine Learning An Introductory Guide
No ratings yet
Combining Classifiers in Machine Learning An Introductory Guide
11 pages
Flight Price Prediction Analysis
No ratings yet
Flight Price Prediction Analysis
15 pages
Cse Final Year College Project Title Names
No ratings yet
Cse Final Year College Project Title Names
27 pages
355955B30 Siddesh Mahind SMA Exp-5
No ratings yet
355955B30 Siddesh Mahind SMA Exp-5
11 pages
Digital Computer Fundamentals
No ratings yet
Digital Computer Fundamentals
37 pages
Data Mining-Outlier Analysis
No ratings yet
Data Mining-Outlier Analysis
6 pages
SkillDzire AI Program Book
No ratings yet
SkillDzire AI Program Book
39 pages
PaySafe Al Intelligent Fraud Detection For UPI Transactions Using Machine Learning
No ratings yet
PaySafe Al Intelligent Fraud Detection For UPI Transactions Using Machine Learning
7 pages
AIML Lab Manual for VTU Students
No ratings yet
AIML Lab Manual for VTU Students
43 pages
Machine Learning Lecture Notes
100% (1)
Machine Learning Lecture Notes
54 pages
Major Project Report BIG MART Final Reedited
No ratings yet
Major Project Report BIG MART Final Reedited
91 pages
Installing and Configuring Tools
No ratings yet
Installing and Configuring Tools
5 pages
Digital Forensics of Hand-Held Devices
No ratings yet
Digital Forensics of Hand-Held Devices
23 pages
Improve Profiling Bank Customer Behavior Using ML
No ratings yet
Improve Profiling Bank Customer Behavior Using ML
8 pages
Unit I Dbms
0% (1)
Unit I Dbms
45 pages
Twitter Spam Detection Methods
No ratings yet
Twitter Spam Detection Methods
45 pages
Machine Learning Lab Manual Final
No ratings yet
Machine Learning Lab Manual Final
65 pages
Final PPT ANN
No ratings yet
Final PPT ANN
30 pages
DWDM Unitwise Qns
100% (1)
DWDM Unitwise Qns
3 pages
Message Authentication
No ratings yet
Message Authentication
47 pages
AI Phishing Detection System
No ratings yet
AI Phishing Detection System
8 pages
Question Bank: T.E. (Computer Engineering) Data Science and Big Data Analytics (2019 Pattern)
No ratings yet
Question Bank: T.E. (Computer Engineering) Data Science and Big Data Analytics (2019 Pattern)
4 pages
Facets of Data
No ratings yet
Facets of Data
6 pages
Study Material BTech IT VIII Sem Subject Deep Learning Deep Learning Btech IT VIII Sem
No ratings yet
Study Material BTech IT VIII Sem Subject Deep Learning Deep Learning Btech IT VIII Sem
30 pages
Full Stack AI Development Unit 1
0% (1)
Full Stack AI Development Unit 1
2 pages
DAV - Viva QnA - Doubtly - in
No ratings yet
DAV - Viva QnA - Doubtly - in
12 pages
KCA 303 Unit-2
No ratings yet
KCA 303 Unit-2
32 pages
Deep Learning and Machine Learning Basics
No ratings yet
Deep Learning and Machine Learning Basics
66 pages
DSM Module 1
No ratings yet
DSM Module 1
60 pages
Fdsa Unit 5
No ratings yet
Fdsa Unit 5
48 pages
LP-IV Lab Manual
No ratings yet
LP-IV Lab Manual
42 pages
Wa0000.
No ratings yet
Wa0000.
1 page
Data Analytics & R Programming: Decision Tree Algorithm
No ratings yet
Data Analytics & R Programming: Decision Tree Algorithm
10 pages
Major Project Report - AKTU
No ratings yet
Major Project Report - AKTU
15 pages
Flight Price Prediction Capstone Project Submission 2
No ratings yet
Flight Price Prediction Capstone Project Submission 2
69 pages
Mock Aptitude Test 1
No ratings yet
Mock Aptitude Test 1
12 pages
Credit Card Customer Segmentation by Clustering: Bennett NG Teng Seng
No ratings yet
Credit Card Customer Segmentation by Clustering: Bennett NG Teng Seng
6 pages
Srinivas 2021
No ratings yet
Srinivas 2021
6 pages
Mini Project Report Sample
No ratings yet
Mini Project Report Sample
15 pages
Machine Learning Thesis
No ratings yet
Machine Learning Thesis
92 pages
Wilson, McCormack Et Al. - Lived Experience of Fetal Alcohol Spectrum Disorder
No ratings yet
Wilson, McCormack Et Al. - Lived Experience of Fetal Alcohol Spectrum Disorder
11 pages
L - Chap-5
No ratings yet
L - Chap-5
34 pages
ABB Low Voltage Coils & Kits Pricing
No ratings yet
ABB Low Voltage Coils & Kits Pricing
1 page
Ngai - Cuteness of The Avant-Garde
No ratings yet
Ngai - Cuteness of The Avant-Garde
37 pages
Air Content of Freshly Mixed Concrete by The Pressure Method
No ratings yet
Air Content of Freshly Mixed Concrete by The Pressure Method
9 pages
Modular Kitchen Design Details
No ratings yet
Modular Kitchen Design Details
25 pages
Paediatric Bronchoscopy Progress in Respiratory Research Kostas N. Priftis Download
No ratings yet
Paediatric Bronchoscopy Progress in Respiratory Research Kostas N. Priftis Download
53 pages
Writing Effective Reports Handouts
No ratings yet
Writing Effective Reports Handouts
40 pages
150 Financial Independence Prompt Templates
No ratings yet
150 Financial Independence Prompt Templates
7 pages
Full Body Dumbbell Workout
No ratings yet
Full Body Dumbbell Workout
13 pages
Awareness on Poverty and Malnutrition
No ratings yet
Awareness on Poverty and Malnutrition
9 pages
Environmental Studies Exam Paper
No ratings yet
Environmental Studies Exam Paper
34 pages
Plant Breeding Techniques Overview
100% (4)
Plant Breeding Techniques Overview
33 pages
USCIS Quito Interview Notice for I-730
No ratings yet
USCIS Quito Interview Notice for I-730
4 pages
Word To LaTeX Info
No ratings yet
Word To LaTeX Info
18 pages
Stacks and Subroutines in 8085
No ratings yet
Stacks and Subroutines in 8085
25 pages
Schoolhead SY 2024 2025 OPCRF
No ratings yet
Schoolhead SY 2024 2025 OPCRF
12 pages
Computer Architecture Essentials
No ratings yet
Computer Architecture Essentials
138 pages
Data Structures Complete
50% (2)
Data Structures Complete
255 pages
Lectures in International Marketing 2019
No ratings yet
Lectures in International Marketing 2019
61 pages
AutoIt WebUI Integration Guide
No ratings yet
AutoIt WebUI Integration Guide
20 pages
India's National Policy on Older Persons
No ratings yet
India's National Policy on Older Persons
9 pages
SP - Building Construction
No ratings yet
SP - Building Construction
9 pages
Research in Daily Life II
100% (1)
Research in Daily Life II
47 pages
ECE3009
No ratings yet
ECE3009
2 pages
Weyand Et Al 2010 The Biological Limits To Running Speed Are Imposed From The Ground Up
No ratings yet
Weyand Et Al 2010 The Biological Limits To Running Speed Are Imposed From The Ground Up
12 pages
Lecture 4.8 - Summary of Contents Introduced in Week 1 To 4
No ratings yet
Lecture 4.8 - Summary of Contents Introduced in Week 1 To 4
35 pages
Karbohidrat: Gula, Pati & Serat
No ratings yet
Karbohidrat: Gula, Pati & Serat
56 pages
Understanding Elements and Mixtures
No ratings yet
Understanding Elements and Mixtures
16 pages
Enterobacterales Summary Tables
No ratings yet
Enterobacterales Summary Tables
3 pages

Uber Data Analysis Using Machine Learning

Uploaded by

Uber Data Analysis Using Machine Learning

Uploaded by

Uber data analysis using Machine

CHAPTER 2: LITERATURE SURVEY

Title:- “Spatio-temporal pricing for ridesharing platform”

Literature survey 3:-

Title:- “Web Information Extraction on Multiple Ontologies Based on Concept Relationships

3.1 Existing system

Supervised algorithms used in the Existing system of the problem statement

 Less amount of accuracy score

 Increasing the accuracy score

3.5 System Requirements

3.6 Feasibility study

CHAPTER 4: SYSTEM DESIGN

CHAPTER 5: SYSTEM IMPLEMENTATION

#The following libraries are required to run this notebook

**Reading the uber dataset**

# Print the first 10 elements

#print the type of data in Date/Time

**[0mins -15mins], [15mins - 30mins], [30mins - 45mins] and [45mins - 60mins]**

**This will allow us to visualize the time series more precisely.**

#create a new column to store this new binned column

#printing the new column - BinnedHour

### Visualizing the Dataset

#defining a dictionary to map the weekday to day name

#Separating the date to another column

**Rearranging the dataset for weekly analysis**

#Grouping the weekly_data daywise

#Plotting the graphs for a better visualization

#Creating a bar plot

#Unstacking the data to create heatmap

**Here is another way of looking at it:**

**Finding average rides on any day**

#A mapper to map base number with its name

#Count plot of Base

**Let's display the latitude - longitude information in 2D:**

#convert to radians and apply haverson formula

#calculating distance to MM and ESB for each point in the dataset

#printing the first 10 elements of the updated dataset

print((uber_data[['Distance MM','Distance ESB']]<0.25).sum())

distance_data = [(uber_data[['Distance MM','Distance ESB']] < dist).sum() for dist in distance_range]

#concatentate and transpose

#lets mark MM and ESB on the map

#convert to numpy array and plot it

#Displaying the map

#Take on 10000 points to plot (Just to speed up things)

i = uber_data[['Distance MM','Distance ESB']] < 0.25

#Create a copy of the data

#use a smaller weight

#Restricting data to that before 8th july for faster calculations

#convert to list to feed it to heatmapwithtime

#add heatmap to folium map(uber_map)

**Click the play button to visualize the timeseries**

#Total rides on each day in july

ttestvals = for_ttest.groupby('Time').apply(lambda x: ttest_ind(x[x['Day']<'Saturday']

**Let's plot and see the values for all timeslots**

#Let's plot the "statistic" column

for rect in ax.patches:

**Note that their count is very low**

**Let's visualize a KDE plot of the pvalue to confirm this:**

## Checking Relations in Data

#get numbers of each weekday

#Get the day number, removing month and year

#Remove unwanted columns that were created for visualization

#create a redundant columns for easy counting of tolal rides

**Let's visualize a pairplot**

**Notice the clusters in data! Especially time-rides, day-rides.**

**Let's create a jointplot of Rides vs Time**

iii. Data preprocessing

iv. Feature Extraction

vi. Evaluation model

Software integration testing is the incremental integration testing of two or more

Test Case for Excel Sheet Verification:

9.1 Future enhancements

Name of the Figure

2. Use case Diagram

For instance, the machine is trying to understand the relationship between

Machine learning can be grouped into two broad learning tasks:

Reading the uber dataset

[0mins -15mins], [15mins - 30mins], [30mins - 45mins] and [45mins - 60mins]

This will allow us to visualize the time series more precisely.

Rearranging the dataset for weekly analysis

Here is another way of looking at it:

Finding average rides on any day

Let's display the latitude - longitude information in 2D:

Click the play button to visualize the timeseries

Let's plot and see the values for all timeslots

Note that their count is very low

Let's visualize a KDE plot of the pvalue to confirm this:

Let's visualize a pairplot

Notice the clusters in data! Especially time-rides, day-rides.

Let's create a jointplot of Rides vs Time