Uber Data Analysis Using Machine Learning
Uber Data Analysis Using Machine Learning
learning
ABSTRACT
The paper demonstrates how an Uber dataset, which incorporates Uber's data for New
York City, works. Uber is classified as a peer-to-peer (P2P) platform. The website
connects you with drivers who will drive you to your desired location. The dataset
contains primary data on Uber pickups, which includes information such as the date,
time, and longitude-latitude coordinates. The paper uses the information to describe
how to utilise the k-means clustering algorithm to categorise the various sections of
New York City. Because the sector is expanding and projected to continue to develop
in the near future. Effective taxi dispatching will allow each driver and passenger to
spend less time looking for each other.The model is used to forecast demand at
various locations throughout the city.
CHAPTER 1: INTRODUCTION
Introduction:-
Uber connects you with drivers who can drive you to your desired location or
destination. This dataset contains primary data on Uber collections in San Francisco,
including the date, time of trip, and longitude and latitude information. Uber operates
in over 900 metropolitan regions globally. Implementing a section of the k-means
clustering algorithm predicts the frequency of data travels.
Title:- “How cities use regulation for innovation: the case of Uber, Lyft and Sidecar in San
Francisco”
Year:- 2017
Authors:- Onesimo Flores a , Lisa Rayle b
Abstract:-
How do government actors facilitate or hinder private innovation in urban mobility,
and how does local context mediate this relationship? In this paper we examine the
regulatory response to on-demand ride services—or “ridesourcing”—through a case
study of San Francisco, CA. The entry of Lyft, Sidecar, and UberX in San Francisco
in 2012 raised serious questions about the legality of ridesourcing, and sparked
significant conflict within regulatory agencies. After sustained debate, regulators
decided to welcome the services provided by new companies and crafted a new
regulatory framework that legalized the provision of forprofit, on-demand ride
services using personal vehicles. We ask, given strong arguments on each side, what
motivated public officials in each city to facilitate, rather than hinder, the new
services? How did they achieve regulatory reform?.
3.2 Drawbacks
Based on the problems of forecasting errors and risk of overfitting due to large
datasets. The data analyzed and sent to the company is resulted as inefficient and
ineffective. Thus to overcome the problem we are going to predict the pickup of cab
from a coordinated cluster of points predicted by using applied k-means clustering
algorithm.
3.4 Advantages
Hardware:
OS – Windows 7, 8 and 10 (32 and 64 bit)
RAM – 4GB
Software:
Anaconda Navigator
in Python Language
Jupyter Notebook
Feasibility study in the sense it's a practical approach of implementing the proposed model of system .
Here for a machine learning projects .we generally collect the input from online websites and filter the
input data and visualize them in graphical format and then the data is divided for training and testing .
That training is testing data is given to the algorithms to predict the data .
1. First, we take dataset.
2. Filter dataset according to requirements and create a new dataset which has
attribute according to analysis to be done
3. Perform Pre-Processing on the dataset
4. Split the data into training and testing
5. Train the model with training data then analyze testing dataset over classification
algorithm
6. Finally you will get results as accuracy metrics.
DATA COLLECTION
Data collection is a process in which information is gathered from many sources
which is later used to develop the machine learning models. The data should be
stored in a way that makes sense for problem. In this step the data set is converted
into the understandable format which can be fed into machine learning models.
Data used in this paper is a set of cervical cancer data with 15 features . This step
is concerned with selecting the subset of all available data that you will be
working with. ML problems start with data preferably, lots of data (examples or
observations) for which you already know the target answer. Data for which you
already know the target answer is called labelled data.
DATA PRE-PROCESSING
Organize your selected data by formatting, cleaning and sampling from it.
Three common data pre-processing steps are:
Formatting: The data you have selected may not be in a format that is suitable for you
to work with. The data may be in a relational database and you would like it in a flat
file, or the data may be in a proprietary file format and you would like it in a
relational database or a text file.
Cleaning: Cleaning data is the removal or fixing of missing data. There may be data
instances that are incomplete and do not carry the data you believe you need to
address the problem. These instances may need to be removed. Additionally, there
may be sensitive information in some of the attributes and these attributes may need
to be anonymized or removed from the data entirely.
Sampling: There may be far more selected data available than you need to work with.
More data can result in much longer running times for algorithms and larger
computational and memory requirements. You can take a smaller representative
sample of the selected data that may be much faster for exploring and prototyping
solutions before considering the whole dataset.
FEATURE EXTRATION
Next thing is to do Feature extraction is an attribute reduction process. Unlike feature
selection, which ranks the existing attributes according to their predictive
significance, feature extraction actually transforms the attributes. The transformed
attributes, or features, are linear combinations of the original attributes. Finally, our
models are trained using Classifier algorithm. We use classify module on Natural
Language Toolkit library on Python. We use the labelled dataset gathered. The rest of
our labelled data will be used to evaluate the models. Some machine learning
algorithms were used to classify pre-processed data. The chosen classifiers were
Random forest. These algorithms are very popular in text classification tasks.
EVALUATION MODEL
Model Evaluation is an integral part of the model development process. It helps to
find the best model that represents our data and how well the chosen model will work
in the future. Evaluating model performance with the data used for training is not
acceptable in data science because it can easily generate overoptimistic and over fitted
models. There are two methods of evaluating models in data science, Hold-Out and
Cross-Validation. To avoid over fitting, both methods use a test set (not seen by the
model) to evaluate model performance.
Performance of each classification model is estimated base on its averaged. The result
will be in the visualized form. Representation of classified data in the form of graphs.
Accuracy is defined as the percentage of correct predictions for the test data. It can be
calculated easily by dividing the number of correct predictions by the number of total
predictions.
4.3 UML Diagrams
The Unified Modeling Language (UML) is used to specify, visualize, modify,
construct and document the artifacts of an object-oriented software intensive system
under development. UML offers a standard way to visualize a system's architectural
blueprints, including elements such as:
actors
business processes
(logical) components
activities
programming language statements
database schemas, and
Reusable software components.
UML combines best techniques from data modeling (entity relationship diagrams),
business modeling (work flows), object modeling, and component modeling. It can be
used with all processes, throughout the software development life cycle, and across
different implementation technologies. UML has synthesized the notations of the
Booch method, the Object-modeling technique (OMT) and Object-oriented software
engineering (OOSE) by fusing them into a single, common and widely usable
modeling language. UML aims to be a standard modeling language which can model
concurrent and distributed systems.
Sequence Diagram:
Sequence Diagrams Represent the objects participating the interaction horizontally
and time vertically. A Use Case is a kind of behavioral classifier that represents a
declaration of an offered behavior. Each use case specifies some behavior, possibly
including variants that the subject can perform in collaboration with one or more
actors. Use cases define the offered behavior of the subject without reference to its
internal structure. These behaviors, involving interactions between the actor and the
subject, may result in changes to the state of the subject and communications with its
environment. A use case can include possible variations of its basic behavior,
including exceptional behavior and error handling.
Activity Diagrams-:
Activity diagrams are graphical representations of Workflows of stepwise activities
and actions with support for choice, iteration and concurrency. In the Unified
Modeling Language, activity diagrams can be used to describe the business and
operational step-by-step workflows of components in a system. An activity diagram
shows the overall flow of control.
Usecase diagram:
UML is a standard language for specifying, visualizing, constructing, and
documenting the artifacts of software systems.
UML was created by Object Management Group (OMG) and UML 1.0 specification
draft was proposed to the OMG in January 1997.
OMG is continuously putting effort to make a truly industry standard.
UML stands for Unified Modeling Language.
UML is a pictorial language used to make software blue prints
Class diagram
The class diagram is the main building block of object-oriented modeling. It is used
for general conceptual modeling of the systematic of the application, and for detailed
modeling translating the models into programming code. Class diagrams can also be
used for data modeling.[1] The classes in a class diagram represent both the main
elements, interactions in the application, and the classes to be programmed.
In the diagram, classes are represented with boxes that contain three compartments:
The top compartment contains the name of the class. It is printed in bold and centered,
and the first letter is capitalized.
The middle compartment contains the attributes of the class. They are left-aligned and
the first letter is lowercase.
The bottom compartment contains the operations the class can execute. They are also
left-aligned and the first letter is lowercase.
!pip3 -q install numpy pandas matplotlib seaborn geopy folium datetime scipy sklearn tensorflow
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
import geopy.distance
from math import radians,cos,sin,asin,sqrt
import folium
import datetime
from folium.plugins import HeatMap
from scipy.stats import ttest_ind
matplotlib.rcParams.update({'font.size': 12})
!ls ../input/uber-pickups-in-new-york-city/uber-raw-data-jul14.csv
uber_data = pd.read_csv('uber-raw-data-jul14.csv')
**The type is str!. Let's convert it to datetime format for easy indexing**
uber_data['Date/Time'] = pd.to_datetime(uber_data['Date/Time'])
**Let us divide each hour in existing Date/Time column into four smaller bins of 15 mins each:**
**Let us visualize the total uber rides per day in the month of July 2014**
plt.figure(figsize=(15,8))
uber_data['BinnedHour'].dt.day.value_counts().sort_index().plot(kind='bar',color='green')
for item in plt.gca().get_xticklabels():
item.set_rotation(45)
plt.title('Uber Rides per day in July 2014 at NYC')
plt.xlabel('Days')
_=plt.ylabel('Rides')
**Observe the nearly recurring pattern in the data!. It is very noticable after day 11.**
**Let us have a more closer look at it, say every 15 minutes from July 1 to July 31.**
plt.figure(figsize=(15,8))
uber_data['BinnedHour'].value_counts().sort_index().plot(c='darkblue',alpha=0.8)
plt.title('Uber Rides every 15 mins in the month of July at NYC')
plt.xlabel('Days')
_=plt.ylabel('No. of Rides')
**The underlying trend is clearly visible now. It conveys that in a day there are times when the pickups
are very low and very high, and they seem to follow a pattern.**
**Q) Which times correspond to the highest and lowest peaks in the plot above?**
uber_data['BinnedHour'].value_counts()
**The highest peak corresponds to the time 19:15(7:15 PM), 15th July 2014 and has a ride count of
915 and the lowest peak corresponds to the time 02:30, 7th July 2014 and has a ride count of 10**
**Now, Lets visualize the week wise trends in the data. For it, we have to map each date into its day
name using a dictionary**
#Defining ordered category of week days for easy sorting and visualization
uber_data['Day']=pd.Categorical(uber_data['Day'],categories=['Monday','Tuesday','Wednesday','Thursd
ay','Friday','Saturday','Sunday'],ordered=True)
#Separating time from the "BinnedHour" Column
uber_data['Time']=uber_data['BinnedHour'].dt.time
weekly_data =
uber_data.groupby(['Date','Day','Time']).count().dropna().rename(columns={'BinnedHour':'Rides'})
['Rides'].reset_index()
weekly_data.head(10)
**Grouping weekly_data by days to plot total rides per week in july 2014.**
#Creating a customized color palette for custom hue according to height of bars
vals = daywise.to_numpy().ravel()
normalized = (vals - np.min(vals)) / (np.max(vals) - np.min(vals))
indices = np.round(normalized * (len(vals) - 1)).astype(np.int32)
palette = sns.color_palette('Reds', len(vals))
colorPal = np.array(palette).take(indices, axis=0)
**According to the bar plot above, rides are maximum on Thursdays and minimum on Sundays.
Sundays having the lowest number of rides makes sense logically, as it's a holiday and people often
take rest on that day.**
weekly_data = weekly_data.groupby(['Day','Time']).mean()['Rides']
weekly_data.head(10)
plt.figure(figsize=(15,15))
sns.heatmap(weekly_data,cmap='Greens')
_=plt.title('Heatmap of average rides in time vs day grid')
**The heatmap indicates that the maximum average uber rides occur around 5:30PM to 6:15PM on
Wednesdays and Thursdays and their values fall between 550 to 620.**
plt.figure(figsize=(15,12))
weekly_data.plot(ax=plt.gca())
_=plt.title('Average rides per day vs time')
_=plt.ylabel('Average rides')
plt.locator_params(axis='x', nbins=10)
plt.figure(figsize=(15,10))
weekly_data.T.mean().plot(c = 'black')
_=plt.title('Average uber rides on any day in July 2014 at NYC')
plt.locator_params(axis='x', nbins=10)
**This plot further confirms that the average rides on any given day is lowest around 2 AM and highest
in the around 5:30 PM.**
**Now, let's try visualizing the relationship between Base and total number of rides in July 2014:**
**The above plot tells us that most uber rides originated from Weiter Base and least from Danach-
NY**
**To know more about the distribution of latitudes and longitudes, let's plot their histograms along
with KDEs**
plt.figure(figsize=(10,10))
sns.histplot(uber_data['Lat'], bins='auto',kde=True,color='r',alpha=0.4,label = 'latitude')
plt.legend(loc='upper right')
plt.xlabel('Latitude')
plt.twiny()
sns.histplot(uber_data['Lon'], bins='auto',kde=True,color='g',alpha=0.4,label = 'longitude')
_=plt.legend(loc='upper left')
_=plt.xlabel('Longitude')
_=plt.title('Distribution of Latitude and Longitude')
**Most latitudes are around 40.25, and longitudes around 40.75. This is true as the dataset comprises
information only around New York City. This also indicates that most rides happen around (lat,lon) =
(40.25,40.75)**
plt.figure(figsize=(12,12))
sns.scatterplot(x='Lat',y='Lon',data=uber_data,edgecolor='None',alpha=0.5,color='darkblue')
plt.xlabel('Latitude')
plt.ylabel('Longitude')
_=plt.title('Latitude - Longitude Scatter Plot')
**The dark blue area in the center shows the regions in New York City that had most number of uber
rides in July 2014. The plot is better understood when a geographical map is placed underneath**
**Let's use geopy to calculate the distance between Metropolitan Museum and Emperical State
Building**
#This is an example of using geopy
metro_art_coordinates = (40.7794,-73.9632)
empire_state_building_coordinates = (40.7484,-73.9857)
distance = geopy.distance.distance(metro_art_coordinates,empire_state_building_coordinates)
print("Distance = ",distance)
**Using geopy on a larger dataset may be time consuming on slower PC's. Hence let's use the
haversine method**
def haversine(coordinates1,coordinates2):
lat1=coordinates1[0]
lon1=coordinates1[1]
lat2=coordinates2[0]
lon2=coordinates2[1]
a = sin(dlat/2)**2 + cos(lat1)*cos(lat2)*sin(dlon/2)**2
c = 2*asin(sqrt(a))
r = 3956
return c*r
print("Distance (mi) = ",haversine(metro_art_coordinates,empire_state_building_coordinates))
**Now, Let's try to predict which place they are more closer to, say MM or ESB. This can be done by
individually calculating the distance between each uber ride coordinates with MM or ESB coordinates.
If they are found to be in a particular threshold radius with MM, then we can predict that the ride is
going to MM. Similarly for ESB.**
#Now, let's keep a threshold of 0.25 miles and calculate the number of points that are closer to MM and
ESB
#according to these thresholds
**The result above shows the number of rides predicted to MM and ESB**
distance_range = np.arange(0.1,5.1,0.1)
distance_data
#Shifting index
distance_data.index = distance_range
distance_data=distance_data.rename(columns={'Distance MM':'CloserToMM','Distance
ESB':'CloserToESB'})
plt.figure(figsize=(12,12))
distance_data.plot(ax=plt.gca())
plt.title('Number of Rides Closer to ESB and MM')
plt.xlabel('Threshold Radius(mi)')
plt.ylabel('Rides')
**The number of riders to MM and ESB initially diverges, but comes closer as threshold increases.
Hence as radius increases, the rate of people going towards MM gets higher than that to ESB. In
another way of thinking, as we expand the radius, most of the newly discovered rides are going to
MM.**
**Now let us observe the heatmap plotted on geographical map (using folium)**
#initilize the map around NYC and set the zoom level to 10
uber_map = folium.Map(location=metro_art_coordinates,zoom_start=10)
**Lets reduce the "Influence" of each point on the heatmap by using a weight of 0.5 (by default it is
1)**
uber_data['Weight']=0.5
#Plotting
uber_map = folium.Map(metro_art_coordinates,zoom_start=10)
folium.plugins.HeatMap(Lat_Lon,radius=15).add_to(uber_map)
uber_map
**The plot looks easy to visualize now. Boundaries and intensity distribution is clear**
**Let's now create a HeatMap that changes with time. This will help us to visualize the number of uber
rides geographically at a given time.**
**We are plotting only the points that are in a radius of 0.25 miles from MM or ESB**
i.head(10)
#Create a boolean mask to choose the rides that satisfy the 0.25 radius threshold
i=i.any(axis=1)
i[i==True]
#Generate samples for each timestamp in "BinnedHour" (these are the points that are plotted for each
timestamp)
map_data = map_data.groupby("BinnedHour").apply(lambda x:
x[['Lat','Lon','Weight']].sample(int(len(x)/3)).to_numpy().tolist())
map_data
#The index to be passed on to heatmapwithtime needs to be a time series of the following format
data_hour_index = [x.strftime("%m%d%Y, %H:%M:%S") for x in map_data.index]
#initialize map
uber_map = folium.Map(location=metro_art_coordinates,zoom_start=10)
#plotting
hm = folium.plugins.HeatMapWithTime(date_hour_data,index=date_hour_data)
uber_data
weekends = weekly_data[['Saturday','Sunday']]
weekdays = weekly_data.drop(['Saturday','Sunday'],axis=1)
weekends = weekends.mean(axis=1)
weekdays = weekdays.mean(axis=1)
weekdays_weekends = pd.concat([weekdays,weekends],axis=1)
weekdays_weekends.columns = ['Weekdays','Weekends']
weekdays_weekends
plt.figure(figsize=(15,10))
weekdays_weekends.plot(ax=plt.gca())
weekly_data.T.mean().plot(ax=plt.gca(),c = 'black',label='Net Average')
_=plt.title('Time Averaged Rides: Weekend, Weekdays, Net Average (Whole July)')
_=plt.legend()
**The Net average plot is more similar to the weekdays average because there are more weekdays than
weekends.**
**In early morning, weekends have more rides. This makes sense as people often go out at night during
the weekends.**
**The number of rides around 8 AM is less on weekends, but more on weekdays as it is usually the
time when people goto work. Also, in the weekends, there is a surge in the number of evening rides as
people return from work.**
**Let us normalize the weekday and weekends data with their own respective sums. This will give us
an insight into the proportional data and help us answer questions like - "What percentage of rides
happened around 12AM on weekends or weekdays"?**
plt.figure(figsize=(15,10))
(weekdays_weekends/weekdays_weekends.sum()).plot(ax=plt.gca())
_=plt.title('Time Averaged Rides (Normalized) - Weekend, Weekdays')
**Nearly 1.5% of the total rides on weekends happen at midnight but only 0.5% of the total rides
happen on weekdays!**
**Also, nearly 2% of the total rides on weekdays happen around 5:30PM!**
**So far, we have made our observations by eye. Let us do a statistical T test to compare the time-
averaged rides on weekdays and weekends**
#Grouping by date and time and creating a dataset that gives the total rides every 15 mins
for_ttest = uber_data.groupby(['Date','Time']).count()['Day'].reset_index(level=1)
#Normalizing the dataset by dividing rides in each time slot on a day by total number of rides on that
day
for_ttest = pd.concat([for_ttest['Day']/uber_data.groupby(['Date']).count()
['Day'],for_ttest['Time']],axis=1)
#renaming
for_ttest=for_ttest.rename(columns={'Day':'NormalizedRides'})
for_ttest
for_ttest = pd.concat([for_ttest,pd.to_datetime(for_ttest.reset_index()
['Date']).dt.day_name().to_frame().set_index(for_ttest.index).rename(columns={'Date':'Day'})],axis=1)
#uber_data.groupby(['Date','Time','Day']).count().dropna().reset_index()
[['Date','Day']].set_index('Date')
for_ttest
**The rides are first normalized by dividing the number of rides in each time slot by the total number
of rides on that day**
**Then they are grouped by time and split to weekend and weekdays data and a T test is applied on
them.**
**A Null hypothesis is assumed: The average ride counts are similar for each time slot on weekends
and weekdays**
ttestvals=pd.DataFrame(ttestvals.to_list(),index = ttestvals.index)
ttestvals
**The t-statistic value is -11.5 around midnight! This means that the assumption(hypothesis) does not
hold at that time. The pvalue is very low, hence the null hypthesis is rejected around midnight**
**if we hold a p-value threshold of 5% (confidence level = 95%), corresponding t-statistic value is
1.96**
**The time-average ride counts are assumed similar on weekdays and weekends if the width of the bar
plot is less than 1.96. Such values are colored in green.**
#KDE plot
plt.figure(figsize=(8,8))
ttestvals['pvalue'].plot(kind='kde',color='darkblue',ax=plt.gca())
plt.title('KDE plot - P_value')
_=plt.xlabel('p_value')
**Density peaks around p_value=0. Hence it confirms that the time-averaged rides vary greatly at most
time slots on weekends and weekdays**
**P-value distribution:**
plt.figure(figsize=(12,10))
ax=ttestvals['pvalue'].plot(kind='line',color='black',ax=plt.gca())
plt.axhline(y=0.05,alpha=0.5,color='black',linestyle='--')
plt.locator_params(axis='x',nbins=20)
for item in plt.gca().get_xticklabels():
item.set_rotation(45)
_=plt.title('Time vs P_value')
_=plt.ylabel('P_value')
**The threshold is p = 0.05. The null hypothesis is accepted at p_values below 0.05**
uber_data
#create a copy
df = uber_data.copy()
#Convert datetime to float. egs: 1:15AM will be 1.25, 12:45 will be 12.75 etc
def func(x):
hr = float(x.hour)
minute = int(x.minute/15)
return hr + minute/4
df['Time']=df['Date/Time'].apply(func)
df
#count the number of rides for a given day, weekday number, time and base
df = df.groupby(['Day','WeekDay','Time','Base']).count()
['DropMe'].reset_index().rename(columns={'DropMe':'Rides'})
df
#Weekends are given special emphasis, as their trends were very different from that on weekdays.
#so we devote a special columns indicating whether the day is weekday or not
df['Weekend']=df.apply(lambda x: 1 if(x['WeekDay']>4) else 0,axis=1)
sns.pairplot(df,hue='Base')
plt.figure()
_=sns.jointplot(x='Rides',y='Time',data = df,hue='Base')
CHAPTER 6: RESULTS
6.1 Screenshots
i. Importing packages
ii. Data Collection
Functional tests provide systematic demonstrations that functions tested are available
as specified by the business and technical requirements, system documentation, and
user manuals.
Functional testing is centered on the following items:
Functions: Identified functions must be exercised.
Output: Identified classes of software outputs must be exercised.
Systems/Procedures: system should work properly
Integration Testing
Here in machine learning we are dealing with dataset which is in excel sheet format
so if any test case we need means we need to check excel file. Later on classification
will work on the respective columns of dataset .
Test Case 1 :
RESULTS
The program predicts the pickup location of the cab based on the centroids plotted
using applied by k- means clustering for appropriate cab scheduled for pickup.
CHAPTER 8: CONCLUSION
8.1 Conclusion
The conclusion of the project is to project a basic outline of trips travelled with
respect to latitude and longitude of locations and pinpoint the locations travelled with
respect to the frequency of trips travelled by a uber cab during the day and also based
on the cross analyzing of the dataset based on the latitude and longitude of the point
travelled by the cab which is then analyzed by deploying k- means clustering which
classifies the locations on the basis of centroids and then orders the frequency of trips
based on labels or clusters. By the location given by the user, the algorithm predicts
the cluster nearest to the location so that cab can be assigned to the user for pickup.
CHAPTER 9: FUTURE ENHANCEMENTS
The future work suggests that the system will provide the location to the user. The
algorithm then records the time, latitude, longitude of the trip and assigns it to a
cluster nearest to the passenger location where a cab is scheduled for pickup. We can
also predict the passenger count on each district to deploy more cabs to the clustered
coordinates using convolutional neural networks (CNN).
REFERENCES
[1] Poulsen, L.K., Dekkers, D., Wagenaar, N., Snijders, W., Lewinsky, B.,
Mukkamala, R.R.andVatrapu, R., 2016, June. Green Cabs vs. Uber in New York City.
In 2016 IEEEInternational Congress on Big Data (BigData Congress) (pp. 222-229).
IEEE.
[2] Faghih, S.S., Safikhani, A., Moghimi, B. and Kamga, C., 2017. Predicting Short-
Term UberDemand Using Spatio-Temporal Modeling: A New York City Case Study.
arXiv preprintarXiv:1712.02001.
[3] Guha, S. and Mishra, N., 2016. Clustering data streams. In Data stream
management (pp.169-187). Springer, Berlin, Heidelberg.
[4] Ahmed, M., Johnson, E.B. and Kim, B.C., 2018. The Impact of Uber and Lyft on
TaxiService Quality Evidence from New York City. Available at SSRN 3267082.
[5] Wallsten, S., 2015. The competitive effects of the sharing economy: how is Uber
changingtaxis. Technology Policy Institute, 22, pp.1-21.
[6] Sotiropoulos, D.N., Pournarakis, D.E. and Giaglis, G.M., 2016, July. A genetic
algorithmapproach for topic clustering: A centroid- based encoding scheme. In 2016
7th InternationalConference on Information, Intelligence, Systems & Applications
(IISA) (pp. 1-8). IEEE
[7] Faghih, S.S., Safikhani, A., Moghimi, B. and Kamga, C., 2019. Predicting Short-
TermUber Demand in New York City Using Spatiotemporal Modeling. Journal of
Computing inCivil Engineering, 33(3), p.05019002.
[8] Shah, D., Kumaran, A., Sen, R. and Kumaraguru, P., 2019, May. Travel Time
EstimationAccuracy in Developing Regions: An Empirical Case Study with Uber
Data in Delhi-NCR✱.In Companion Proceedings of The 2019 World Wide Web
Conference (pp. 130-136). ACM.
[9] Kumar, A., Surana, J., Kapoor, M. and Nahar, P.A., CSE 255 Assignment II
PerfectingPassenger Pickups: An Uber Case Study.
[10] L.Liu, C.Andris, and C.Ratti , “Uncovering cabdrivers behaviour patterns from
their digital traces”,Compu. Environ.UrbanSyst.,vol.34,no.6,pp.541-548,2010
[11] R.H.Hwang,Y.L.Hsueh , and Y.T.Chen,”An effective taxi recommender system
model on a spatio-temporal factor analysis model,”Inf.Sci.,vol.314,pp.28-40,2015.
[12] Vigneshwari, S., and M. Aramudhan. "Web information extraction on multiple
ontologies based on concept relationships upon training the user profiles." In
Artificial Intelligence and Evolutionary Algorithms in Engineering Systems, pp. 1-8.
Springer, New Delhi, 2015.
[13] L. Rayle, D. Dai, N. Chan, R. Cervero, and S. Shaheen, “Just a better taxi? a
survey-based comparison of taxis, transit, and ridesourcing services in san francisco,”
Transport Policy, vol. 45, 01 2016.
[14] O. Flores and L. Rayle, “How cities use regulation for innovation: the case of
uber, lyft and sidecar in san francisco,” Transportation research procedia, vol. 25, pp.
3756–3768, 2017.
[15] H. A. Chaudhari, J. W. Byers, and E. Terzi, “Putting data in the driver’s seat:
Optimizing earnings for on-demand ride-hailing,” in Proceedings of the Eleventh
ACM International Conference on Web Search and Data Mining. ACM, 2018, pp.
90–98.
LIST OF FIGURES
3. Class Diagram
4. Sequence Diagram
5. Activity Diagram
Domain Specification
MACHINE LEARNING
Machine Learning is a system that can learn from example through self-
improvement and without being explicitly coded by programmer. The
breakthrough comes with the idea that a machine can singularly learn
from the data (i.e., example) to produce accurate results.
Machine learning combines data with statistical tools to predict an output.
This output is then used by corporate to makes actionable insights.
Machine learning is closely related to data mining and Bayesian
predictive modeling. The machine receives data as input, use an
algorithm to formulate answers.
A typical machine learning tasks are to provide a recommendation. For
those who have a Netflix account, all recommendations of movies or
series are based on the user's historical data. Tech companies are using
unsupervised learning to improve the user experience with personalizing
recommendation.
Machine learning is also used for a variety of task like fraud detection,
predictive maintenance, portfolio optimization, automatize task and so
on.
Machine Learning vs. Traditional Programming
Traditional programming differs significantly from machine learning. In
traditional programming, a programmer code all the rules in consultation
with an expert in the industry for which software is being developed.
Each rule is based on a logical foundation; the machine will execute an
output following the logical statement. When the system grows complex,
more rules need to be written. It can quickly become unsustainable to
maintain.
Machine Learning
How does Machine learning work?
Machine learning is the brain where all the learning takes place. The way
the machine learns is similar to the human being. Humans learn from
experience. The more we know, the more easily we can predict. By
analogy, when we face an unknown situation, the likelihood of success is
lower than the known situation. Machines are trained the same. To make
an accurate prediction, the machine sees an example. When we give the
machine a similar example, it can figure out the outcome. However, like a
human, if its feed a previously unseen example, the machine has
difficulties to predict.
The core objective of machine learning is the learning and inference.
First of all, the machine learns through the discovery of patterns. This
discovery is made thanks to the data. One crucial part of the data scientist
is to choose carefully which data to provide to the machine. The list of
attributes used to solve a problem is called a feature vector. You can
think of a feature vector as a subset of data that is used to tackle a
problem.
The machine uses some fancy algorithms to simplify the reality and
transform this discovery into a model. Therefore, the learning stage is
used to describe the data and summarize it into a model.
Define a question
Collect data
Visualize data
Train algorithm
Test the Algorithm
Collect feedback
Refine the algorithm
Loop 4-7 until the results are satisfying
Use the model to make a prediction
Once the algorithm gets good at drawing the right conclusions, it applies
that knowledge to new sets of data.
Machine learning Algorithms and where they are used?
Classification task
Regression task
Classification
Imagine you want to predict the gender of a customer for a commercial.
You will start gathering data on the height, weight, job, salary,
purchasing basket, etc. from your customer database. You know the
gender of each of your customer, it can only be male or female. The
objective of the classifier will be to assign a probability of being a male
or a female (i.e., the label) based on the information (i.e., features you
have collected). When the model learned how to recognize male or
female, you can use new data to make a prediction. For instance, you just
got new information from an unknown customer, and you want to know
if it is a male or female. If the classifier predicts male = 70%, it means the
algorithm is sure at 70% that this customer is a male, and 30% it is a
female.
The label can be of two or more classes. The above example has only two
classes, but if a classifier needs to predict object, it has dozens of classes
(e.g., glass, table, shoes, etc. each object represents a class)
Regression
When the output is a continuous value, the task is a regression. For
instance, a financial analyst may need to forecast the value of a stock
based on a range of feature like equity, previous stock performances,
macroeconomics index. The system will be trained to estimate the price
of the stocks with the lowest possible error.
Unsupervised learning
In unsupervised learning, an algorithm explores input data without being
given an explicit output variable (e.g., explores customer demographic
data to identify patterns)
You can use it when you do not know how to classify the data, and you
want the algorithm to find patterns and classify the data for you
K-means Puts data into some groups (k) that each Clustering
clustering contains data with similar
characteristics (as determined by the
model, not in advance by humans)
Automation:
Finance Industry
Government organization
Healthcare industry
Marketing
Deep Learning
Deep learning is a computer software that mimics the network of neurons
in a brain. It is a subset of machine learning and is called deep learning
because it makes use of deep neural networks. The machine uses different
layers to learn from the data. The depth of the model is represented by the
number of layers in the model. Deep learning is the new state of the art in
term of AI. In deep learning, the learning phase is done through a neural
network.
Reinforcement Learning
Reinforcement learning is a subfield of machine learning in which
systems are trained by receiving virtual "rewards" or "punishments,"
essentially learning by trial and error. Google's DeepMind has used
reinforcement learning to beat a human champion in the Go games.
Reinforcement learning is also used in video games to improve the
gaming experience by providing smarter bot.
One of the most famous algorithms are:
Q-learning
Deep Q network
State-Action-Reward-State-Action (SARSA)
Deep Deterministic Policy Gradient (DDPG)
With machine learning, you need fewer data to train the algorithm than
deep learning. Deep learning requires an extensive and diverse set of data
to identify the underlying structure. Besides, machine learning provides a
faster-trained model. Most advanced deep learning architecture can take
days to a week to train. The advantage of deep learning over machine
learning is it is highly accurate. You do not need to understand what
features are the best representation of the data; the neural network learned
how to select critical features. In machine learning, you need to choose
for yourself what features to include in the model.
TensorFlow
the most famous deep learning library in the world is Google's
TensorFlow. Google product uses machine learning in all of its products
to improve the search engine, translation, image captioning or
recommendations.
To give a concrete example, Google users can experience a faster and
more refined the search with AI. If the user types a keyword a the search
bar, Google provides a recommendation about what could be the next
word.
Google wants to use machine learning to take advantage of their massive
datasets to give users the best experience. Three different groups use
machine learning:
Researchers
Data scientists
Programmers.
They can all use the same toolset to collaborate with each other and
improve their efficiency.
Google does not just have any data; they have the world's most massive
computer, so TensorFlow was built to scale. TensorFlow is a library
developed by the Google Brain Team to accelerate machine learning and
deep neural network research.
It was built to run on multiple CPUs or GPUs and even mobile operating
systems, and it has several wrappers in several languages like Python, C+
+ or Java.
In this tutorial, you will learn
TensorFlow Architecture
Tensor flow architecture works in three parts:
You can train it on multiple machines then you can run it on a different
machine, once you have the trained model.
The model can be trained and used on GPUs as well as CPUs. GPUs were
initially designed for video games. In late 2010, Stanford researchers
found that GPU was also very good at matrix operations and algebra so
that it makes them very fast for doing these kinds of calculations. Deep
learning relies on a lot of matrix multiplication. TensorFlow is very fast
at computing the matrix multiplication because it is written in C++.
Although it is implemented in C++, TensorFlow can be accessed and
controlled by other languages mainly, Python.
Finally, a significant feature of Tensor Flow is the Tensor Board. The
Tensor Board enables to monitor graphically and visually what
TensorFlow is doing.
List of Prominent Algorithms supported by TensorFlow
PYTHON OVERVIEW
Python is a high-level, interpreted, interactive and object-oriented
scripting language. Python is designed to be highly readable. It uses
English keywords frequently where as other languages use punctuation,
and it has fewer syntactical constructions than other languages.
Python is Interpreted: Python is processed at runtime by the
interpreter. You do not need to compile your program before
executing it. This is similar to PERL and PHP.
Python is Interactive: You can actually sit at a Python prompt and
interact with the interpreter directly to write your programs.
Python is Object-Oriented: Python supports Object-Oriented style
or technique of programming that encapsulates code within objects.
Python is a Beginner's Language: Python is a great language for the
beginner-level programmers and supports the development of a wide
range of applications from simple text processing to WWW browsers
to games.
History of Python
Python was developed by Guido van Rossum in the late eighties and
early nineties at the National Research Institute for Mathematics and
Computer Science in the Netherlands.
Python Features
Python's features include:
Apart from the above-mentioned features, Python has a big list of good
features, few are listed below:
IT supports functional and structured programming methods as well
as OOP.
It can be used as a scripting language or can be compiled to byte-code
for building large applications.
It provides very high-level dynamic data types and supports dynamic
type checking.
IT supports automatic garbage collection.
It can be easily integrated with C, C++, COM, ActiveX, CORBA, and
Java.
ANACONDA NAVIGATOR
Anaconda Navigator is a desktop graphical user interface (GUI) included
in Anaconda distribution that allows you to launch applications and easily
manage conda packages, environments and channels without using
command-line commands. Navigator can search for packages on
Anaconda Cloud or in a local Anaconda Repository. It is available for
Windows, mac OS and Linux.
Why use Navigator?
In order to run, many scientific packages depend on specific versions
of other packages. Data scientists often use multiple versions of
many packages, and use multiple environments to separate these different
versions.
The command line program conda is both a package manager and an
environment manager, to help data scientists ensure that each version of
each package has all the dependencies it requires and works correctly.
Navigator is an easy, point-and-click way to work with packages and
environments without needing to type conda commands in a terminal
window. You can use it to find the packages you want, install them in an
environment, run the packages and update them, all inside Navigator.
WHAT APPLICATIONS CAN I ACCESS USING NAVIGATOR?
The following applications are available by default in Navigator:
Jupyter Lab
Jupyter Notebook
QT Console
Spyder
VS Code
Glue viz
Orange 3 App
Rodeo
RStudio
Advanced conda users can also build your own Navigator applications
How can I run code with Navigator?
The simplest way is with Spyder. From the Navigator Home tab, click
Spyder, and write and execute your code.
You can also use Jupyter Notebooks the same way. Jupyter Notebooks
are an increasingly popular system that combine your code, descriptive
text, output, images and interactive interfaces into a single notebook file
that is edited, viewed and used in a web browser.
What’s new in 1.9?
Add support for Offline Mode for all environment related actions.
Add support for custom configuration of main windows links.
Numerous bug fixes and performance enhancements.