Python for R Users
By
Chandan Routray
As a part of internship at
www.decisionstats.com
Basic Commands
Functions
Python
Downloading and installing a package
install.packages('name')
pipinstallname
Load a package
library('name')
importnameasother_name
Checking working directory
getwd()
importos
os.getcwd()
Setting working directory
setwd()
os.chdir()
List files in a directory
dir()
os.listdir()
List all objects
ls()
globals()
Remove an object
rm('name')
del('object')
Dec 2014
Copyrigt www.decisionstats.com
Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Data Frame Creation
R
Python
(Using pandas package*)
Creating a data frame df of
dimension 6x4 (6 rows and 4
columns) containing random
numbers
A<
matrix(runif(24,0,1),nrow=6,ncol=4)
df<data.frame(A)
Here,
runif function generates 24 random
numbers between 0 to 1
matrix function creates a matrix from
those random numbers, nrow and ncol
sets the numbers of rows and columns
to the matrix
data.frame converts the matrix to data
frame
importnumpyasnp
importpandasaspd
A=np.random.randn(6,4)
df=pd.DataFrame(A)
Here,
np.random.randn generates a
matrix of 6 rows and 4 columns;
this function is a part of numpy**
library
pd.DataFrame converts the matrix
in to a data frame
*To install Pandas library visit: http://pandas.pydata.org/; To import Pandas library type: import pandas as pd;
**To import Numpy library type: import numpy as np;
Dec 2014
Copyrigt www.decisionstats.com
Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Data Frame Creation
R
Dec 2014
Copyrigt www.decisionstats.com
Python
Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Data Frame: Inspecting and Viewing Data
R
Python
(Using pandas package*)
Getting the names of rows and
columns of data frame df
Seeing the top and bottom x
rows of the data frame df
rownames(df)
df.index
returns the name of the rows
returns the name of the rows
colnames(df)
df.columns
returns the name of the columns
returns the name of the columns
head(df,x)
df.head(x)
returns top x rows of data frame
returns top x rows of data frame
tail(df,x)
df.tail(x)
returns bottom x rows of data frame
returns bottom x rows of data frame
Getting dimension of data frame
df
dim(df)
df.shape
returns in this format : rows, columns
returns in this format : (rows,
columns)
Length of data frame df
length(df)
len(df)
returns no. of columns in data frames
returns no. of columns in data frames
Dec 2014
Copyrigt www.decisionstats.com
Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Data Frame: Inspecting and Viewing Data
R
Dec 2014
Python
Copyrigt www.decisionstats.com
Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Data Frame: Inspecting and Viewing Data
R
Python
(Using pandas package*)
Getting quick summary(like
mean, std. deviation etc. ) of
data in the data frame df
summary(df)
df.describe()
returns mean, median , maximum,
minimum, first quarter and third quarter
returns count, mean, standard
deviation, maximum, minimum, 25%,
50% and 75%
Setting row names and columns
names of the data frame df
rownames(df)=c(A,B,C,D,
E,F)
df.index=[A,B,C,D,
E,F]
set the row names to A, B, C, D and E
set the row names to A, B, C, D and
E
colnames=c(P,Q,R,S)
set the column names to P, Q, R and S
df.columns=[P,Q,R,S]
set the column names to P, Q, R and
S
Dec 2014
Copyrigt www.decisionstats.com
Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Data Frame: Inspecting and Viewing Data
Python
Dec 2014
Copyrigt www.decisionstats.com
Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Data Frame: Sorting Data
R
Python
(Using pandas package*)
Sorting the data in the data
frame df by column name P
Dec 2014
Copyrigt www.decisionstats.com
df[order(df$P),]
df.sort(['P'])
Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Data Frame: Sorting Data
R
Dec 2014
Python
Copyrigt www.decisionstats.com
Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Data Frame: Data Selection
R
Python
(Using pandas package*)
Slicing the rows of a data frame
from row no. x to row no.
y(including row x and y)
df[x:y,]
df[x1:y]
Python starts counting from 0
Slicing the columns name x,Y myvars<c(X,Y)
newdata<df[myvars]
etc. of a data frame df
df.loc[:,[X,Y]]
Selecting the the data from row
no. x to y and column no. a
to b
df[x:y,a:b]
df.iloc[x1:y,a1,b]
Selecting the element at row no.
x and column no. y
df[x,y]
df.iat[x1,y1]
Dec 2014
Copyrigt www.decisionstats.com
Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Data Frame: Data Selection
R
Dec 2014
Python
Copyrigt www.decisionstats.com
Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
10
Data Frame: Data Selection
R
Python
(Using pandas package*)
Using a single columns values
to select data, column name A
Dec 2014
subset(df,A>0)
df[df.A>0]
It will select the all the rows in which the
corresponding value in column A of that
row is greater than 0
It will do the same as the R function
Python
Copyrigt www.decisionstats.com
Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
11
Mathematical Functions
Functions
Python
(import math and numpy library)
Dec 2014
Sum
sum(x)
math.fsum(x)
Square Root
sqrt(x)
math.sqrt(x)
Standard Deviation
sd(x)
numpy.std(x)
Log
log(x)
math.log(x[,base])
Mean
mean(x)
numpy.mean(x)
Median
median(x)
numpy.median(x)
Copyrigt www.decisionstats.com
Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
12
Mathematical Functions
R
Dec 2014
Python
Copyrigt www.decisionstats.com
Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
13
Data Manipulation
Functions
Python
(import math and numpy library)
Convert character variable to numeric variable
as.numeric(x)
For a single value:int(x),long(x),float(x)
For list, vectors etc.: map(int,x),map(float,x)
Convert factor/numeric variable to character
variable
paste(x)
For a single value: str(x)
For list, vectors etc.: map(str,x)
Check missing value in an object
is.na(x)
math.isnan(x)
Delete missing value from an object
na.omit(list)
cleanedList=[xforxinlistifstr(x)!
='nan']
Calculate the number of characters in character
value
nchar(x)
len(x)
Dec 2014
Copyrigt www.decisionstats.com
Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
14
Date & Time Manipulation
Functions
Python
(import lubridate library)
(import datetime library)
Getting time and date at an instant
Sys.time()
datetime.datetime.now()
Parsing date and time in format:
YYYY MM DD HH:MM:SS
d<Sys.time()
d_format<ymd_hms(d)
d=datetime.datetime.now()
format=%Y%b%d%H:%M:%S
d_format=d.strftime(format)
Dec 2014
Copyrigt www.decisionstats.com
Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
15
Data Visualization
Functions
Python
(import matplotlib library**)
Scatter Plot variable1 vs variable2
plot(variable1,variable2)
plt.scatter(variable1,variable2)
plt.show()
boxplot(Var)
plt.boxplot(Var)
plt.show()
Histogram for Var
hist(Var)
plt.hist(Var)
plt.show()
Pie Chart for Var
pie(Var)
frompylabimport*
pie(Var)
show()
Boxplot for Var
** To import matplotlib library type: import matplotlib.pyplot as plt
Dec 2014
Copyrigt www.decisionstats.com
Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
16
Data Visualization: Scatter Plot
Python
Dec 2014
Copyrigt www.decisionstats.com
Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
17
Data Visualization: Box Plot
Python
Dec 2014
Copyrigt www.decisionstats.com
Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
18
Data Visualization: Histogram
R
Dec 2014
Python
Copyrigt www.decisionstats.com
Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
19
Data Visualization: Line Plot
Python
Dec 2014
Copyrigt www.decisionstats.com
Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
20
Data Visualization: Bubble
R
Dec 2014
Python
Copyrigt www.decisionstats.com
Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
22
Data Visualization: Bar
R
Dec 2014
Python
Copyrigt www.decisionstats.com
Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
21
Data Visualization: Pie Chart
R
Dec 2014
Python
Copyrigt www.decisionstats.com
Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
23
Thank You
For feedback contact
DecisionStats.com
Coming up
Data Mining in Python and R ( see draft slides
afterwards)
Machine Learning: SVM on Iris Dataset
R(Using svm* function)
library(e1071)
data(iris)
trainset<iris[1:149,]
testset<iris[150,]
svm.model<svm(Species~.,data=
trainset,cost=100,gamma=1,type='C
classification')
svm.pred<predict(svm.model,testset[5])
svm.pred
Output: Virginica
Python(Using sklearn** library)
#LoadingLibrary
fromsklearnimportsvm
#ImportingDataset
fromsklearnimportdatasets
#CallingSVM
clf=svm.SVC()
#Loadingthepackage
iris=datasets.load_iris()
#Constructingtrainingdata
X,y=iris.data[:1],iris.target[:1]
#FittingSVM
clf.fit(X,y)
#Testingthemodelontestdata
printclf.predict(iris.data[1])
Output: 2, corresponds to Virginica
*To know more about svm function in R visit: http://cran.r-project.org/web/packages/e1071/
** To install sklearn library visit : http://scikit-learn.org/, To know more about sklearn svm visit: http://scikitlearn.org/stable/modules/generated/sklearn.svm.SVC.html
Linear Regression: Iris Dataset
R(Using lm* function)
data(iris)
total_size<dim(iris)[1]
num_target<c(rep(0,total_size))
for(iin1:length(num_target)){
if(iris$Species[i]=='setosa'){num_target[i]<0}
elseif(iris$Species[i]=='versicolor')
{num_target[i]<1}
else{num_target[i]<2}
}
iris$Species<num_target
train_set<iris[1:149,]
test_set<iris[150,]
fit<lm(Species~0+Sepal.Length+Sepal.Width+
Petal.Length+Petal.Width,data=train_set)
coefficients(fit)
predict.lm(fit,test_set)
Output: 1.64
Python(Using sklearn** library)
fromsklearnimportlinear_model
fromsklearnimportdatasets
iris=datasets.load_iris()
regr=linear_model.LinearRegression()
X,y=iris.data[:1],iris.target[:1]
regr.fit(X,y)
print(regr.coef_)
printregr.predict(iris.data[1])
Output: 1.65
*To know more about lm function in R visit: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/lm.html
** ** To know more about sklearn linear regression visit : http://scikitlearn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
Random forest: Iris Dataset
R(Using randomForest* package)
Python(Using sklearn** library)
fromsklearnimportensemble
fromsklearnimportdatasets
clf=
ensemble.RandomForestClassifier(n_estimato
rs=100,max_depth=10)
for(iin1:length(num_target)){
if(iris$Species[i]=='setosa'){num_target[i]<0} iris=datasets.load_iris()
X,y=iris.data[:1],iris.target[:1]
elseif(iris$Species[i]=='versicolor')
{num_target[i]<1}
clf.fit(X,y)
else{num_target[i]<2}}
printclf.predict(iris.data[1])
library(randomForest)
data(iris)
total_size<dim(iris)[1]
num_target<c(rep(0,total_size))
iris$Species<num_target
train_set<iris[1:149,]
test_set<iris[150,]
iris.rf<randomForest(Species~.,
data=train_set,ntree=100,importance=TRUE,
proximity=TRUE)
print(iris.rf)
predict(iris.rf,test_set[5],predict.all=TRUE)
Output: 1.845
Output: 2
*To know more about randomForest package in R visit: http://cran.r-project.org/web/packages/randomForest/
** To know more about sklearn random forest visit : http://scikitlearn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
Decision Tree: Iris Dataset
R(Using rpart* package)
library(rpart)
data(iris)
sub<c(1:149)
fit<rpart(Species~.,data=iris,
subset=sub)
fit
predict(fit,iris[sub,],type="class")
Output: Virginica
Python(Using sklearn** library)
fromsklearn.datasetsimportload_iris
fromsklearn.treeimport
DecisionTreeClassifier
clf=
DecisionTreeClassifier(random_state=0)
iris=datasets.load_iris()
X,y=iris.data[:1],iris.target[:1]
clf.fit(X,y)
printclf.predict(iris.data[1])
Output: 2, corresponds to virginica
*To know more about rpart package in R visit: http://cran.r-project.org/web/packages/rpart/
** To know more about sklearn desicion tree visit : http://scikitlearn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
Gaussian Naive Bayes: Iris Dataset
R(Using e1071* package)
library(e1071)
data(iris)
Python(Using sklearn** library)
fromsklearn.datasetsimportload_iris
fromsklearn.naive_bayesimportGaussianNB
trainset<iris[1:149,]
testset<iris[150,]
classifier<naiveBayes(trainset[,1:4],
trainset[,5])
clf=GaussianNB()
iris=datasets.load_iris()
X,y=iris.data[:1],iris.target[:1]
clf.fit(X,y)
printclf.predict(iris.data[1])
predict(classifier,testset[,5])
Output: Virginica
Output: 2, corresponds to virginica
*To know more about e1071 package in R visit: http://cran.r-project.org/web/packages/e1071/
** To know more about sklearn Naive Bayes visit : http://scikitlearn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html
K Nearest Neighbours: Iris Dataset
R(Using kknn* package)
library(kknn)
data(iris)
trainset<iris[1:149,]
testset<iris[150,]
iris.kknn<kknn(Species~.,
trainset,testset,distance=1,
kernel="triangular")
summary(iris.kknn)
fit<fitted(iris.kknn)
fit
Output: Virginica
Python(Using sklearn** library)
fromsklearn.datasetsimportload_iris
fromsklearn.neighborsimport
KNeighborsClassifier
knn=KNeighborsClassifier()
iris=datasets.load_iris()
X,y=iris.data[:1],iris.target[:1]
knn.fit(X,y)
printknn.predict(iris.data[1])
Output: 2, corresponds to virginica
*To know more about kknn package in R visit:
** To know more about sklearn k nearest neighbours visit : http://scikitlearn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html
Thank You
For feedback please let us know at
[email protected]