R is a programming language and software environment for statistical analysis,
graphics representation and reporting. R was created by Ross Ihaka and Robert
Gentleman at the University of Auckland, New Zealand, and is currently developed
by the R Development Core Team.
The core of R is an interpreted computer language which allows branching and
looping as well as modular programming using functions. R allows integration with
the procedures written in the C, C++, .Net, Python or FORTRAN languages for
efficiency.
R is freely available under the GNU General Public License, and pre-compiled binary
versions are provided for various operating systems like Linux, Windows and Mac.
R is free software distributed under a GNU-style copy left, and an official part of the
GNU project called GNU S.
Evolution of R
R was initially written by Ross Ihaka and Robert Gentleman at the Department of
Statistics of the University of Auckland in Auckland, New Zealand. R made its first
appearance in 1993.
• A large group of individuals has contributed to R by sending code and
bug reports.
• Since mid-1997 there has been a core group (the "R Core Team") who
can modify the R source code archive.
Features of R
As stated earlier, R is a programming language and software environment for
statistical analysis, graphics representation and reporting. The following are the
important features of R −
• R is a well-developed, simple and effective programming language
which includes conditionals, loops, user defined recursive functions and
input and output facilities.
• R has an effective data handling and storage facility,
• R provides a suite of operators for calculations on arrays, lists, vectors
and matrices.
• R provides a large, coherent and integrated collection of tools for data
analysis.
• R provides graphical facilities for data analysis and display either
directly at the computer or printing at the papers.
As a conclusion, R is world’s most widely used statistics programming language. It's
the # 1 choice of data scientists and supported by a vibrant and talented community
of contributors. R is taught in universities and deployed in mission critical business
applications.
R Command Prompt
Once you have R environment setup, then it’s easy to start your R command prompt
by just typing the following command at your command prompt −
$R
This will launch R interpreter and you will get a prompt > where you can start typing
your program as follows −
> myString <- "Hello, World!"
> print ( myString)
[1] "Hello, World!"
Here first statement defines a string variable myString, where we assign a string
"Hello, World!" and then next statement print() is being used to print the value stored
in variable myString.
R Script File
Usually, you will do your programming by writing your programs in script files and
then you execute those scripts at your command prompt with the help of R interpreter
called Rscript. So let's start with writing following code in a text file called test.R as
under −
# My first program in R Programming
myString <- "Hello, World!"
print ( myString)
Save the above code in a file test.R and execute it at Linux command prompt as given
below. Even if you are using Windows or other system, syntax will remain same.
$ Rscript test.R
Comments
Comments are like helping text in your R program and they are ignored by the
interpreter while executing your actual program. Single comment is written using #
in the beginning of the statement as follows −
# My first program in R Programming
R does not support multi-line comments but you can perform a trick which is
something as follows −
if(FALSE) {
"This is a demo for multi-line comments and it should be put inside either a
single OR double quote"
}
myString <- "Hello, World!"
print ( myString)
[1] "Hello, World!"
Though above comments will be executed by R interpreter, they will not interfere
with your actual program. You should put such comments inside, either single or
double quote.
Statistical analysis in R is performed by using many in-built functions. Most of these
functions are part of the R base package. These functions take R vector as an input
along with the arguments and give the result.
The functions we are discussing in this chapter are mean, median and mode.
Mean
It is calculated by taking the sum of the values and dividing with the number of values
in a data series.
The function mean() is used to calculate this in R.
Syntax
The basic syntax for calculating mean in R is −
mean(x, trim = 0, na.rm = FALSE, ...)
Following is the description of the parameters used −
• x is the input vector.
• trim is used to drop some observations from both end of the sorted
vector.
• na.rm is used to remove the missing values from the input vector.
Example
# Create a vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)
# Find Mean.
result.mean <- mean(x)
print(result.mean)
When we execute the above code, it produces the following result −
[1] 8.22
Median
The middle most value in a data series is called the median. The median() function is
used in R to calculate this value.
Syntax
The basic syntax for calculating median in R is −
median(x, na.rm = FALSE)
Following is the description of the parameters used −
• x is the input vector.
• na.rm is used to remove the missing values from the input vector.
# Create the vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)
# Find the median.
median.result <- median(x)
print(median.result)
Mode
The mode is the value that has highest number of occurrences in a set of data. Unike
mean and median, mode can have both numeric and character data.
R does not have a standard in-built function to calculate mode. So we create a user
function to calculate mode of a data set in R. This function takes the vector as input
and gives the mode value as output.
Example
# Create the function.
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
# Create the vector with numbers.
v <- c(2,1,2,3,1,2,3,4,1,5,5,3,2,3)
# Calculate the mode using the user function.
result <- getmode(v)
print(result)
# Create the vector with characters.
charv <- c("o","it","the","it","it")
# Calculate the mode using the user function.
result <- getmode(charv)
print(result)
When we execute the above code, it produces the following result −
[1] 2
[1] "it"
Information About the Data Set
You can use the question mark (?) to get information about the mtcars data
set:
Example
# Use the question mark to get information about the data set
?mtcars
Result:
mtcars {datasets} R Documentation
Motor Trend Car Road Tests
Description
The data was extracted from the 1974 Motor Trend US magazine, and
comprises fuel consumption and 10 aspects of automobile design and
performance for 32 automobiles (1973-74 models).
Usage
mtcars
Format
A data frame with 32 observations on 11 (numeric) variables.
[, 1] mpg Miles/(US) gallon
[, 2] cyl Number of cylinders
[, 3] disp Displacement (cu.in.)
[, 4] hp Gross horsepower
[, 5] drat Rear axle ratio
[, 6] wt Weight (1000 lbs)
[, 7] qsec 1/4 mile time
[, 8] vs Engine (0 = V-shaped, 1 = straight)
[, 9] am Transmission (0 = automatic, 1 = manual)
[,10] gear Number of forward gears
[,11] carb Number of carburetors
Note
Get Information
Use the dim() function to find the dimensions of the data set, and
the names() function to view the names of the variables:
Example
Data_Cars <- mtcars # create a variable of the mtcars data set for
better organization
# Use dim() to find the dimension of the data set
dim(Data_Cars)
# Use names() to find the names of the variables from the data set
names(Data_Cars)
Result:
[1] 32 11
[1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am"
"gear"
[11] "carb"
Use the rownames() function to get the name of each row in the first column,
which is the name of each car:
Example
Data_Cars <- mtcars
rownames(Data_Cars)
Result:
[1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710"
[4] "Hornet 4 Drive" "Hornet Sportabout" "Valiant"
[7] "Duster 360" "Merc 240D" "Merc 230"
[10] "Merc 280" "Merc 280C" "Merc 450SE"
[13] "Merc 450SL" "Merc 450SLC" "Cadillac
Fleetwood"
[16] "Lincoln Continental" "Chrysler Imperial" "Fiat 128"
[19] "Honda Civic" "Toyota Corolla" "Toyota Corona"
[22] "Dodge Challenger" "AMC Javelin" "Camaro Z28"
[25] "Pontiac Firebird" "Fiat X1-9" "Porsche 914-2"
[28] "Lotus Europa" "Ford Pantera L" "Ferrari Dino"
[31] "Maserati Bora"
Print Variable Values
If you want to print all values that belong to a variable, access the data
frame by using the $ sign, and the name of the variable (for
example cyl (cylinders)):
Example
Data_Cars <- mtcars
Data_Cars$cyl
Result:
[1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6
8 4
Sort Variable Values
To sort the values, use the sort() function:
Example
Data_Cars <- mtcars
sort(Data_Cars$cyl)
Result:
[1] 4 4 4 4 4 4 4 4 4 4 4 6 6 6 6 6 6 6 8 8 8 8 8 8 8 8 8 8 8 8
8 8
Analyzing the Data
Now that we have some information about the data set, we can start to
analyze it with some statistical numbers.
For example, we can use the summary() function to get a statistical summary
of the data:
Example
Data_Cars <- mtcars
summary(Data_Cars)
Do not worry if you do not understand the output numbers. You will master
them shortly.
The summary() function returns six statistical numbers for each variable:
• Min
• First quantile (percentile)
• Median
• Mean
• Third quantile (percentile)
• Max
We will cover all of them, along with other statistical numbers in the next
chapters.
Max Min
In the previous chapter, we introduced the mtcars data set. We will continue
to use this data set throughout the next pages.
You learned from the R Math chapter that R has several built-in math
functions. For example, the min() and max() functions can be used to find the
lowest or highest value in a set:
Example
Find the largest and smallest value of the variable hp (horsepower).
Data_Cars <- mtcars
max(Data_Cars$hp)
min(Data_Cars$hp)
Result:
[1] 335
[1] 52
Outliers
Max and min can also be used to detect outliers. An outlier is a data point
that differs from rest of the observations.
Example of data points that could have been outliers in the mtcars data set:
• If maximum of forward gears of a car was 11
• If minimum of horsepower of a car was 0
• If maximum weight of a car was 50 000 lbs
Mean
To calculate the average value (mean) of a variable from the mtcars data set,
find the sum of all values, and divide the sum by the number of values.
Sorted observation of wt (weight)
1.513 1.615 1.835 1.935 2.140 2.200
2.620 2.770 2.780 2.875 3.150 3.170
3.435 3.440 3.440 3.440 3.460 3.520
3.730 3.780 3.840 3.845 4.070 5.250
Luckily for us, the mean() function in R can do it for you:
Median
The median value is the value in the middle, after you have sorted all the
values.
If we take a look at the values of the wt variable (from the mtcars data set),
we will see that there are two numbers in the middle:
Sorted observation of wt (weight)
1.513 1.615 1.835 1.935 2.140 2.200
2.620 2.770 2.780 2.875 3.150 3.170
3.435 3.440 3.440 3.440 3.460 3.520
3.730 3.780 3.840 3.845 4.070 5.250
Note: If there are two numbers in the middle, you must divide the sum of
those numbers by two, to find the median.
Luckily, R has a function that does all of that for you: Just use
the median() function to find the middle value:
Example
Find the mid point value of weight (wt):
Data_Cars <- mtcars
median(Data_Cars$wt)
Result:
[1] 3.325
Mode
The mode value is the value that appears the most number of times.
R does not have a function to calculate the mode. However, we can create
our own function to find it.
If we take a look at the values of the wt variable (from the mtcars data set),
we will see that the numbers 3.440 are often shown:
Sorted observation of wt (weight)
1.513 1.615 1.835 1.935 2.140 2.200
2.620 2.770 2.780 2.875 3.150 3.170
3.435 3.440 3.440 3.440 3.460 3.520
3.730 3.780 3.840 3.845 4.070 5.250
Instead of counting it ourselves, we can use the following code to find the
mode:
Example
Data_Cars <- mtcars
names(sort(-table(Data_Cars$wt)))[1]
Result:
[1] "3.44"
Quartiles
Quartiles are data divided into four parts, when sorted in an ascending order:
1. The value of the first quartile cuts off the first 25% of the data
2. The value of the second quartile cuts off the first 50% of the data
3. The value of the third quartile cuts off the first 75% of the data
4. The value of the fourth quartile cuts off the 100% of the data
Use the quantile() function to get the quartiles.
1.2 Install R packages
Packages are the fundamental units created by the community that
contains reproducible R code. These include reusable R functions,
documentation that describes how to use them and sample data.
The directory where packages are stored is called the library. R
comes with a standard set of packages. Others are available for
download and installation. Once installed, they have to be loaded
into the session to be used.
To install a package in R, we simply use the command
install.packages(“Name of the Desired Package”)
1.3 Loading the Data set
There are some data sets that are already pre-installed in R. Here,
we shall be using The Titanic data set that comes built-in R in the
Titanic Package.
While using any external data source, we can use the read command
to load the files(Excel, CSV, HTML and text files etc.)
This data set is also available at Kaggle. You may download the data
set, both train and test files. In this tutorial, we’d be just using the
train data set.
titanic <- read.csv(“C:/Users/Desktop/titanic.csv”,
header=TRUE, sep=”,”)
The above code reads the file titanic.csv into a dataframe titanic.
With Header=TRUE we are specifying that the data includes a
header(column names) and sep=”,” specifies that the values in data
are comma separated.
2. Understanding the Data set
We have used the Titanic data set that contains historical records of
all the passengers who on-boarded the Titanic. Below is a brief
description of the 12 variables in the data set :
• PassengerId: Serial Number
• Survived: Contains binary Values of 0 & 1. Passenger did
not survive — 0, Passenger Survived — 1.
• Pclass — Ticket Class | 1st Class, 2nd Class or 3rd Class
Ticket
• Name — Name of the passenger
• Sex — Male or Female
• Age — Age in years — Integer
• SibSp — No. of Siblings / Spouses — brothers, sisters
and/or husband/wife
• Parch — No. of parents/children — mother/father and/or
daughter, son
• Ticket — Serial Number
• Fare — Passenger fare
• Cabin — Cabin Number
• Embarked — Port of Embarkment | C- Cherbourg, Q —
Queenstown, S — Southhampton
2.1 Peek at your Data
Before we begin working on the dataset, let’s have a good look at the
raw data.
view(titanic)
This helps us in familiarising with the data set.
head(titanic,n) | tail(titanic,n)
In order to have a quick look at the data, we often use the
head()/tail().
Top 10 rows of the data set.
Bottom 5 rows of the data set.
In case we do not explicitly pass the value for n, it takes the default
value of 5, and displays 5 rows.
names(titanic)
This helps us in checking out all the variables in the data set.
Familiarising with all the Variables/Column Names
str(titanic)
This helps in understanding the structure of the data set, data type
of each attribute and number of rows and columns present in the
data.
summary(titanic)
A cursory look at the data
Summary() is one of the most important functions that help in
summarising each attribute in the dataset. It gives a set of
descriptive statistics, depending on the type of variable:
• In case of a Numerical Variable -> Gives Mean, Median,
Mode, Range and Quartiles.
• In case of a Factor Variable -> Gives a table with the
frequencies.
• In case of Factor + Numerical Variables -> Gives the
number of missing values.
• In case of character variables -> Gives the length and the
class.
In case we just need the summary statistic for a particular variable
in the dataset, we can use
summary(datasetName$VariableName) ->
summary(titanic$Pclass)
as.factor(dataset$ColumnName)
There are times when some of the variables in the data set are
factors but might get interpreted as numeric. For example, the
Pclass(Passenger Class) tales the values 1, 2 and 3, however, we
know that these are not to be considered as numeric, as these are
just levels. In order to such variables treated as factors and not as
numbers we need explicitly convert them to factors using the
function as.factor()
3. Analysis & Visualisations
Data Visualisation is an art of turning data into insights that can be
easily interpreted. In this tutorial, we’ll analyse the survival patterns
and check for factors that affected the same.
Points to think about
Now that we have an understanding of the dataset, and the
variables, we need to identify the variables of interest. Domain
knowledge and the correlation between variables help in choosing
these variables. To keep it simple, we have chosen only 3 such
variables, namely Age, Gender, Pclass.
What was the survival rate?
When talking about the Titanic data set, the first question that
comes up is “How many people did survive?”. Let’s have a simple
Bar Graph to demonstrate the same.
ggplot(titanic, aes(x=Survived)) + geom_bar()
On the X-axis we have the survived variable, 0 representing the
passengers that did not survive, and 1 representing the passengers
who survived. The Y -axis represents the number of passengers.
Here we see that over 550 passenger did not survive and ~ 340
passengers survived.
Let’s make is more clear by using checking out the percentages
prop.table(table(titanic$Survived))
Only 38.38% of the passengers who on-boarded the
titanic did survive.
Survival rate basis Gender
It is believed that in case of rescue operations during disasters,
woman’s safety is prioritised. Did the same happen back then?
We see that the survival rate amongst the women was
significantly higher when compared to men. The
survival ratio amongst women was around 75%,
whereas for men it was less than 20%.
Survival Rate basis Class of tickets (Pclass)
There were 3 segments of passengers, depending upon the class they
were travelling in, namely, 1st class, 2nd class and 3rd class. We see
that over 50% of the passengers were travelling in the 3rd class.
Survival Rate basis Passenger Class
1st and 2nd Class passengers disproportionately
survived, with over 60% survival rate of the 1st class
passengers, around 45–50% of 2nd class, and less
than 25% survival rate of those travelling in 3rd class.
I’ll leave you at the thought… Was it because of a preferential
treatment to the passengers travelling elite class, or the proximity, as
the 3rd class compartments were in the lower deck?
Survival Rate basis Class of tickets and Gender(pclass)
We see that the females in the 1st and 2nd class
had a very high survival rate. The survival rate for
the females travelling in 1st and 2nd class was
96% and 92% respectively, corresponding to 37%
and 16% for men. The survival rate for men
travelling 3rd class was less than 15%.
Till now it is evident that the Gender and Passenger class had
significant impact on the survival rates. Let’s now check the impact
of passenger’s Age on Survival Rate.
Survival rates basis age
Looking at the age<10 years section in the graph, we
see that the survival rate is high. And the survival rate
is low and drops beyond the age of 45.
Here we have used bin width of 5, you may try out different values
and see, how the graph changes.
Survival Rate basis Age, Gender and Class of tickets
This graph helps identify the survival patterns considering all the
three variables.
The top 3 sections depict the female survival patterns across the
three classes, while the bottom 3 represent the male survival
patterns across 3 classes. On the x-axis we have the Age.