iNeuron
Machine Learning
Practical Assignments
-Kumar Deepak
Decision Tree Assignment
Predicting Survival in the Titanic Data Set
We will be using a decision tree to make predictions about the Titanic data
set from Kaggle. This data set provides information on the Titanic
passengers and can be used to predict whether a passenger survived or
not.
Loading Data and modules
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
import sklearn
from pandas import Series, DataFrame
from pylab import rcParams
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn import metrics
from sklearn.metrics import classification_report
Url = https://raw.githubusercontent.com/BigDataGal/Python-for-Data-
Science/master/titanic-train.csv
titanic = pd.read_csv(url)
titanic.columns =
['PassengerId','Survived','Pclass','Name','Sex','Age','SibSp','Parch','Ticket','Fare','Cabin','E mbarked']
You use only Pclass, Sex, Age, SibSp (Siblings aboard), Parch
(Parents/children aboard), and Fare to predict whether a passenger
survived.
Task: Deploy this assignment in any cloud platform.(Try to look for free cloud platform)
Assignment: Submit assignment’s deployable link only.
Linear Regression Assignment
Build the linear regression model using scikit learn in boston data to predict 'Price' based on other
dependent variable.
Here is the code to load the data:
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import sklearn
from sklearn.datasets import load_boston
boston = load_boston()
bos = pd.DataFrame(boston.data)
Task: Deploy this assignment in any cloud platform.(Try to look for free cloud platform)
Assignment: Submit assignment’s deployable link only.
Logistic Regression Assignment
I decided to treat this as a classification problem by creating a new binary
variable affair (did the woman have at least one affair?) and trying to
predict the classification for each woman.
Dataset
The dataset I chose is the affairs dataset that comes with Statsmodels. It was derived from a survey of
women in 1974 by Redbook magazine, in which married women were asked about their participation in
extramarital affairs. More information about the study is available in a 1978 paper from the Journal of
Political Economy.
Description of Variables
The dataset contains 6366 observations of 9 variables:
rate_marriage: woman's rating of her marriage (1 = very poor, 5 = very good)
age: woman's age
yrs_married: number of years married
children: number of children
religious: woman's rating of how religious she is (1 = not religious, 4 =strongly religious)
educ: level of education (9 = grade school, 12 = high school, 14 =
some college, 16 = college graduate, 17 = some graduate school, 20 = advanced degree)
occupation: woman's occupation (1 = student, 2 = farming/semi-skilled/unskilled, 3 = "white collar", 4 =
teacher/nurse/writer/technician/skilled, 5 = managerial/business, 6 =professional with advanced
degree)
occupation_husb: husband's occupation (same coding as above)
affairs: time spent in extra-marital affairs
Code to loading data and modules:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
from patsy import dmatrices
from sklearn.linear_model
import LogisticRegression from
sklearn.cross_validation
import train_test_split from sklearn
import metrics from
sklearn.cross_validation
import cross_val_score dta =
sm.datasets.fair.load_pandas().data
#add "affair" column: 1 represents having affairs, 0
represents not dta['affair'] = (dta.affairs >0).astype(int)
y, X = dmatrices('affair ~ rate_marriage + age +yrs_married + children + \ religious + educ +C(occupation)
+ C(occupation_husb)',dta, return_type="dataframe")
X = X.rename(columns ={'C(occupation)[T.2.0]':'occ_2',
'C(occupation)[T.3.0]':'occ_3',
'C(occupation)[T.4.0]':'occ_4',
'C(occupation)[T.5.0]':'occ_5',
'C(occupation)[T.6.0]':'occ_6',
'C(occupation_husb)[T.2.0]':'occ_husb_2',
'C(occupation_husb)[T.3.0]':'occ_husb_3',
'C(occupation_husb)[T.4.0]':'occ_husb_4',
'C(occupation_husb)[T.5.0]':'occ_husb_5',
'C(occupation_husb)[T.6.0]':'occ_husb_6'})
y = np.ravel(y)
Task: Deploy this assignment in any cloud platform.(Try to look for free cloud platform)
Assignment: Submit assignment’s deployable link only.
ML Project
Predicting players rating
In this project you are going to predict the overall rating of soccer player based on their
attributes such as 'crossing', 'finishing etc.
The dataset you are going to use is from European Soccer Database
(https://www.kaggle.com/hugomathien/soccer) has more than 25,000 matches and more
than 10,000 players for European professional soccer seasons from 2008 to 2016.
Download the data in the same folder and run the following commmand to get it in the environment.
About the Dataset
The ultimate Soccer database for data analysis and machine learning
The dataset comes in the form of an SQL database and contains statistics of about 25,000 football
matches, from the top football league of 11 European Countries. It covers seasons from 2008 to 2016
and contains match statistics (i.e: scores, corners, fouls etc...) as well as the team
formations, with player names and a pair of coordinates to indicate their position on the pitch.
+25,000 matches
+10,000 players
11 European Countries with their lead championship
Seasons 2008 to 2016
Players and Teams' attributes* sourced from EA Sports' FIFA video game series, including
the weekly updates
Team line up with squad formation (X, Y coordinates)
Betting odds from up to 10 providers
Detailed match events (goal types, possession, corner, cross, fouls, cards etc...) for
+10,000 matches
The dataset also has a set of about 35 statistics for each player, derived from EA Sports' FIFA video
games. It is not just the stats that come with a new version of the game but also the weekly updates. So
for instance if a player has performed poorly over a period of time and his stats get
impacted in FIFA, you would normally see the same in the dataset.
Python skills required to complete this project:
SQL:
The data is in SQL database so students need to retrive using query language. They also need to know
how to connect SQL database woth python. The library we are using for this in 'sqlite3'.
SQLite3 can be integrated with Python using sqlite3 module, which was written by Gerhard Haring. It
provides an SQL interface compliant with the DB-API 2.0 specification described by PEP 249. You do not
need to install this module separately because it is shipped by default along with Python version 2.5.x
onwards. To use sqlite3 module, you must first create a connection object that represents the database
and then optionally you can create a cursor object, which will help you in executing all the SQL
statements.
Pandas:
Pandas is an open-source, BSD-licensed Python library providing high-performance, easy-to-use data
structures and data analysis tools for the Python programming language. Python with Pandas is used in a
wide range of fields including academic and commercial domains including
finance, economics, Statistics, analytics, etc.In this tutorial, we will learn the various features of Python
Pandas and how to use them in practice.
Scikit Learn
Scikit-learn provides a range of supervised and unsupervised learning algorithms via a consistent
interface in Python.
The library is built upon the SciPy (Scientific Python) that must be installed before you can use scikit-
learn. This stack that includes:
NumPy: Base n-dimensional array package
SciPy: Fundamental library for scientific computing
Matplotlib: Comprehensive 2D/3D plotting
IPython: Enhanced interactive console
Sympy: Symbolic mathematics
Pandas: Data structures and analysis
Extensions or modules for SciPy care conventionally named SciKits. As such, the module provides
learning algorithms and is named scikit-learn.
The vision for the library is a level of robustness and support required for use in production systems. This
means a deep focus on concerns such as easy of use, code quality, collaboration, documentation and
performance.
Machine Learning skills required to complete the project
Supervised learning
Supervised learning deals with learning a function from available training data. A supervised learning
algorithm analyzes the training data and produces an inferred function, which can be used for mapping
new examples.
Regression
Regression is a parametric technique used to predict continuous (dependent) variable given a set of
independent variables. It is parametric in nature because it makes certain assumptions (discussed next)
based on the data set. If the data set follows those assumptions, regression gives
incredible results.
Model evaluation
Student must know how to judge a model on unseen data. What metric to select to judge the
performance
Let's get started.....
Import Libraries
import sqlite3
import pandas as pd
from sklearn.tree import DecisionTreeRegressor from
sklearn.linear_model import LinearRegression from
sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error from
math import sqrt
Read Data from the Database into pandas
# Create your connection.
cnx = sqlite3.connect('database.sqlite')
df = pd.read_sql_query("SELECT * FROM Player_Attributes", cnx)
df.head()
Task: Deploy this assignment in any cloud platform.(Try to look for free cloud platform)
Assignment: Submit assignment’s deployable link only.
Random Forest Assignment
In this assignment students will build the random forest model after
normalizing the variable to house pricing from boston data set.
Following the code to get data into the environment:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import
train_test_split from sklearn.preprocessing
import StandardScaler from sklearn import
datasets boston = datasets.load_boston()
features = pd.DataFrame(boston.data,
columns=boston.feature_names)
targets = boston.target
Task: Deploy this assignment in any cloud platform.(Try to look for free cloud platform)
Assignment: Submit assignment’s deployable link only.\
Time Series Assignment
In this assignment students have to make ARIMA model over shampoo
sales data and check the MSE between predicted and actual value.
Student can download data in .csv format from the following link:
https://raw.githubusercontent.com/blue-yonder/pydse/master/pydse/data/sales-of-shampoo-over-
a-three-ye.csv
Hint:
Following is the command import
packages and data from pandas import
read_csv from pandas import datetime
from matplotlib import pyplot
from statsmodels.tsa.arima_model
import ARIMA from sklearn.metrics
import mean_squared_error
def parser(x):
return datetime.strptime('190'+x, '%Y-%m')
series = read_csv('https://raw.githubusercontent.com/blue-yonder/pydse/master/pydse/data/sales-
of-shampoo-over-a-three-ye.csv', header=0, parse_dates=[0],
index_col=0, squeeze=True, date_parser=parser)
Task: Deploy this assignment in any cloud platform.(Try to look for free cloud platform)
Assignment: Submit assignment’s deployable link only.
Time Series Project Assignment
Hint:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pandas.tools.plotting import
autocorrelation_plot from
statsmodels.graphics.tsaplots import plot_pacf
from statsmodels.tsa.arima_model import ARIMA, ARMAResults
import datetime
import sys
import seaborn as sns
import statsmodels
import statsmodels.stats.diagnostic as diag
from statsmodels.tsa.stattools import adfuller
from scipy.stats.mstats import normaltest
from matplotlib.pyplot import acorr
plt.style.use('fivethirtyeight')
%matplotlib inline
df = pd.read_csv('C:/Users/Downloads/sp500/data_stocks.csv')
df.head()
Problem Statement:
Pick up the following stocks and generate forecasts accordingly
Stocks:
1. NASDAQ.AAPL
2. NASDAQ.ADP
3. NASDAQ.CBOE
4. NASDAQ.CSCO
5. NASDAQ.EBAY
Task: Deploy this assignment in any cloud platform.(Try to look for free cloud platform)
Assignment: Submit assignment’s deployable link only.
XG Boost Assignment
In this assignment students need to predict whether a person makes over 50K per year or not from
classic adult dataset using XGBoost. The description of the dataset is as follows:
Data Set Information:
Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean records
was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&&
(HRSWK>0))
Attribute Information:
Listing of attributes: >50K, <=50K.
age: continuous.
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-
worked.
fnlwgt: continuous.
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th,
12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
education-num: continuous.
marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-
absent, Married-AF-spouse.
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec- managerial, Prof-specialty, Handlers-
cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv,
Protective-serv, Armed-Forces.
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
sex: Female, Male.
capital-gain: continuous.
capital-loss: continuous.
hours-per-week: continuous.
native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-
USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica,
Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti,
Columbia,Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El- Salvador, Trinadad&Tobago,
Peru, Hong, Holand-Netherlands.
Following is the code to load required libraries and data:
import numpy as np
import pandas as pd
train_set = pd.read_csv('http://archive.ics.uci.edu/ml/machine-
learning-databases/adult/adult.dat a', header = None)
test_set = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-
databases/adult/adult.test' , skiprows = 1, header = None)
col_labels = ['age', 'workclass', 'fnlwgt', 'education', 'education_num',
'marital_status', 'occupation','relationship', 'race', 'sex', capital_gain',
'capital_loss', 'hours_per_week', 'native_country', 'wage_class']
train_set.columns = col_labels
test_set.columns = col_labels
Task: Deploy this assignment in any cloud platform.(Try to look for free cloud platform)
Assignment: Submit assignment’s deployable link only.