R-PROGRAMMING FOR
DATA SCIENCE
[Link]
ASSOCIATE PROFESSOR
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
P.S.R ENGINEERING COLLEGE, SIVAKASI
[Link] ASSOCIATE PROFESSOR , CSE
1
DATA SCIENCE PROJECTS THAT USE R
Several industries, such as banking, telecommunications, and media, use R for data science.
Following are some real-world examples of data visualization in r.
1.T-mobile employs R to classify customer support texts in order to connect clients to an agent
appropriately.
[Link] tweets can be analyzed for text using R. The twitterr package supports text analytics and
scraping of twitter data.
[Link] analytics can be combined with R to perform statistical data analysis and build
meaningful data visualizations. This can be achieved by installing the rgoogleanalytics package.
[Link] financial times used R to create data visualizations purely using r and ggplot2 package for
their featured articles such as "is russia-saudi arabia the worst world cup game ever?"
[Link] uses data visualization in R to generate appealing graphics for its publications. BBC has
developed an R package based on the bbplot package and an R cookbook to standardize their data
visualization graphic creation process.
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 2
R FOR DATA SCIENCE
• R is open-source software.
• R can be used for suitable projects for machine learning and deep
learning model building.
• R has a huge capability as a statistical tool.
• R is probably the best visualization tool for depicting insights through
different graphs and charts.
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 3
ADVANTAGES OF R
• R is an open-source software platform that helps create interactive graphs
and provides great visual alternatives, making it even more user-friendly.
• R has a big development community, various developer forums, and a very
friendly community of r enthusiasts.
• R offers the interface from github as well as an enormous catalog for use in data
analysis and data mining.
• There are many powerful r libraries for data science. For example, the R package
shiny allows developers to build interactive web applications directly using R.
• Rmarkdown allows r to support various dynamic and static output formats such as
html, ms word, and pdf.
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 4
DISADVANTAGES OF R
• R has a steep learning curve as the R syntax is quite different and hence,
slightly challenging to learn compared to python.
• R does not offer basic security measures which are essential for production-
grade web applications.
• The performance of r is slower than python or matlab, and it does perform
memory management i.e., R requires a lot of memory.
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 5
SESSION CONTENT
• ADVANCED DATA HANDLING
• RESHAPING DATA
• APPENDING FRAMES
• MERGING DATA FRAMES
• RESHAPING DATA FRAMES
• TABULAR DATA
• WORKING WITH DATES
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 6
ADVANCED DATA HANDLING
• SENTIMENT ANALYSIS.
• UBER DATA ANALYSIS.
• MOVIE RECOMMENDATION SYSTEM.
• CREDIT CARD FRAUD DETECTION.
• WINE QUALITY PREDICTION.
• CUSTOMER SEGMENTATION.
• SPEECH EMOTION RECOGNITION.
• PRODUCT BUNDLE IDENTIFICATION.
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 7
RESHAPING DATA
• R - SPLIT, MERGE AND RESHAPE THE DATA FRAME USING
VARIOUS FUNCTIONS.
• TRANSPOSE OF A MATRIX
• JOINING ROWS AND COLUMNS
• MERGING OF DATA FRAMES
• MELTING AND CASTING
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 8
TRANSPOSE OF A MATRIX
• T() FUNCTION
• TAKES A MATRIX OR DATA FRAME AS AN INPUT AND GIVES
THE TRANSPOSE OF THAT MATRIX OR DATA FRAME AS IT’S
OUTPUT.
• SYNTAX:
T(MATRIX/ DATA FRAME)
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 9
R PROGRAM TO FIND THE
TRANSPOSE OF A MATRIX
FIRST <- MATRIX(C(1:12), NROW=4, BYROW=TRUE)
PRINT("ORIGINAL MATRIX")
FIRST
FIRST <- T(FIRST)
PRINT("TRANSPOSE OF THE MATRIX")
FIRST
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 10
OUTPUT
• [1] "ORIGINAL MATRIX"
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
[4,] 10 11 12
[1] "TRANSPOSE OF THE MATRIX"
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 11
APPENDING FRAMES
• JOINING ROWS AND COLUMNS IN DATA FRAME
• IN R, WE CAN JOIN TWO VECTORS OR MERGE TWO DATA
FRAMES USING FUNCTIONS. THERE ARE BASICALLY TWO
FUNCTIONS THAT PERFORM THESE TASKS:
• CBIND():
• WE CAN COMBINE VECTORS, MATRIX OR DATA FRAMES BY
COLUMNS USING CBIND() FUNCTION.
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 12
• SYNTAX: CBIND(X1, X2, X3)
• WHERE X1, X2 AND X3 CAN BE VECTORS OR MATRICES OR
DATA FRAMES.
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 13
RBIND
RBIND():
• WE CAN COMBINE VECTORS, MATRIX OR DATA FRAMES BY
ROWS USING RBIND() FUNCTION.
• SYNTAX: RBIND(X1, X2, X3)
• WHERE X1, X2 AND X3 CAN BE VECTORS OR MATRICES OR
DATA FRAMES.
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 14
# CBIND AND RBIND FUNCTION IN
R
NAME <- C("SHAONI", "ESHA", "SOUMITRA", "SOUMI")
AGE <- C(24, 53, 62, 29)
ADDRESS <- C("PUDUCHERRY", "KOLKATA", "DELHI", "BANGALORE")
# CBIND FUNCTION
INFO <- CBIND(NAME, AGE, ADDRESS)
PRINT("COMBINING VECTORS INTO DATA FRAME USING CBIND ")
PRINT(INFO)
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 15
# CREATING NEW DATA FRAME
NEWD <- [Link](NAME=C("SOUNAK", "BHABANI"),
AGE=C("28", "87"),
ADDRESS=C("BANGALORE", "KOLKATA"))
# RBIND FUNCTION
[Link] <- RBIND(INFO, NEWD)
PRINT("COMBINING DATA FRAMES USING RBIND ")
PRINT([Link])
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 16
OUTPUT
[1] "COMBINING VECTORS INTO DATA FRAME USING
CBIND "
NAME AGE ADDRESS
[1,] "SHAONI" "24" "PUDUCHERRY"
[2,] "ESHA" "53" "KOLKATA"
[3,] "SOUMITRA" "62" "DELHI"
[4,] "SOUMI" "29" "BANGALORE"
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 17
OUTPUT RBIND
• [1] "COMBINING DATA FRAMES USING RBIND "
• NAME AGE ADDRESS
1 SHAONI 24 PUDUCHERRY
2 ESHA 53 KOLKATA
3 SOUMITRA 62 DELHI
4 SOUMI 29 BANGALORE
5 SOUNAK 28 BANGALORE
6 BHABANI 87 KOLKATA
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 18
MERGING TWO DATA FRAMES
• In R, we can merge two data frames using the merge() function
provided both the data frames should have the same column names. We
may merge the two data frames based on a key value.
• Syntax: merge(dfa, dfb, …)
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 19
MERGING DATA FRAMES
# merging two data frames in r
d1 <- [Link](name=c("shaoni", "soumi", "arjun"),
id=c("111", "112", "113"))
d2 <- [Link](name=c("sounak", "esha"),
id=c("114", "115"))
total <- merge(d1, d2, all=true)
print(total)
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 20
OUTPUT
NAME ID
1 ARJUN 113
2 SHAONI 111
3 SOUMI 112
4 ESHA 115
5 SOUNAK 114
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 21
RESHAPING DATA FRAMES
• DATA RESHAPING INVOLVES MANY STEPS IN ORDER TO
OBTAIN DESIRED OR REQUIRED FORMAT.
• ONE OF THE POPULAR METHODS IS MELTING THE DATA
WHICH CONVERTS EACH ROW INTO A UNIQUE ID-VARIABLE
COMBINATION AND THEN CASTING IT.
• THE TWO FUNCTIONS USED FOR THIS PROCESS:
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 22
MELTING
• MELT():
• IT IS USED TO CONVERT A DATA FRAME INTO A MOLTEN DATA FRAME.
• SYNTAX: MELT(DATA, …, [Link]=FALSE, [Link]=”VALUE”)
• WHERE,
• DATA: DATA TO BE MELTED
… : ARGUMENTS
[Link]: CONVERTS EXPLICIT MISSINGS INTO IMPLICIT MISSINGS
[Link]: STORING VALUES
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 23
CASTING
• DCAST():
• IT IS USED TO AGGREGATE THE MOLTEN DATA FRAME INTO A
NEW FORM.
• SYNTAX: MELT(DATA, FORMULA, [Link])
• WHERE,
• DATA: DATA TO BE MELTED
FORMULA: FORMULA THAT DEFINES HOW TO CAST
[Link]: USED IF THERE IS A DATA AGGREGATION
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 24
# MELT AND CAST
library(mass)
library(reshape)
a <- [Link](id=c("1", "1", "2", "2"),points=c("1", "2", "1", "2"),
x1=c("5", "3", "6", "2"), x2=c("6", "5", "1", "4"))
print("melting")
m <- melt(a, id=c("id", "point"))
print(m)
print("casting")
idmn <- dcast(a, id~variable, mean)
print(idmn)
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 25
MELTING
• MELTING
ID POINTS VARIABLE VALUE
1 1 X1 5
1 2 X1 3
2 1 X1 6
2 2 X1 2
3 1 X2 6
1 2 X2 5
2 1 X2 1
2 2 X2 4
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 26
CASTING
• CASTING
ID X1 X2
1 4 5.5
2 4 2.5
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 27
TABULAR DATA
• LOADING TABULAR DATA
• INSPECTING [Link] OBJECTS
• INDEXING AND SUBSETTING DATA FRAMES
• CATEGORICAL DATA: FACTORS
• CONVERTING FACTORS
• RENAMING FACTORS
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 28
R FUNCTION
• [Link]() to download the csv file that contains the traffic stop
data
[Link]("[Link]
"data/ms_trafficstops_bw.csv")
• [Link]() to load into memory the content of the csv file as an object of
class [Link].
trafficstops <- [Link]("data/ms_trafficstops_bw.csv")
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 29
DISPLAY THE FIRST 6 LINES
• check the top (the first 6 lines) of this data frame using the
function head():
head(trafficstops)
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 30
INSPECTING [Link] OBJECT
S
• INSPECTING THE STRUCTURE OF A DATA FRAME WITH THE
FUNCTION STR():
STR(TRAFFICSTOPS)
THE FUNCTIONS HEAD() AND STR() CAN BE USEFUL TO CHECK
THE CONTENT AND THE STRUCTURE OF A DATA FRAME
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 31
NON--EXHAUSTIVE LIST OF FUNCTIONS TO
GET A SENSE OF THE CONTENT/STRUCTURE
OF THE DATA.
• size:
• dim(trafficstops) - returns a vector with the number of rows in the first element,
and the number of columns as the second element (the dimensions of the
object)
• nrow(trafficstops) - returns the number of rows
• ncol(trafficstops) - returns the number of columns
• length(trafficstops) - returns number of columns
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 32
• content:
• head(trafficstops) - shows the first 6 rows
• tail(trafficstops) - shows the last 6 rows
• names:
• names(trafficstops) - returns the column names (synonym
of colnames() for [Link] objects)
• rownames(trafficstops) - returns the row names
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 33
• summary:
• str(trafficstops) - structure of the object and information about the class,
length and content of each column
• summary(trafficstops) - summary statistics for each column
• most of functions are generic
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 34
SPECIFYING THESE COORDINATES
LEAD TO RESULTS WITH
DIFFERENT CLASSES
• trafficstops[1, 1] # first element in the first column of the data frame (as
a vector)
• trafficstops[1, 6] # first element in the 6th column (as a vector)
• trafficstops[, 1] # first column in the data frame (as a vector)
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 35
• trafficstops[1] # first column in the data frame (as a [Link])
• trafficstops[1:3, 7] # first three elements in the 7th column (as a vector)
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 36
• trafficstops[3, ] # the 3rd row (as a [Link])
• trafficstops[1:6, ] # the 1st to 6th rows, equivalent to head(trafficstops)
• trafficstops[, -1] # the whole data frame, excluding the first column
• trafficstops[-c(7:211211),] # equivalent to head(trafficstops)
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 37
[Link] (OR MATRIX),
COLUMNS CAN BE CALLED BY
• NAME
TRAFFICSTOPS["VIOLATION_RAW"] # RESULT IS A [Link]
• TRAFFICSTOPS[, "VIOLATION_RAW"] # RESULT IS A VECTOR
• TRAFFICSTOPS[["VIOLATION_RAW"]] # RESULT IS A VECTOR
• TRAFFICSTOPS$VIOLATION_RAW # RESULT IS A VECTOR
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 38
CONDITIONAL SUBSETTING
• to extract a subset of a data frame based on certain conditions.
• # the condition: # returns a logical vector of the length of the column
• trafficstops$county_name == "webster county" # use this vector to extract
rows and all columns # note the comma: we want *all* columns
• trafficstops[trafficstops$county_name == "webster county", ] # assign
extract to a new data frame
• webster_trafficstops <- trafficstops[trafficstops$county_name == "webster
county", ]
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 39
CATEGORICAL DATA: FACTORS
• FACTORS ARE USED TO REPRESENT CATEGORICAL DATA.
FACTORS CAN BE ORDERED OR UNORDERED, AND
UNDERSTANDING THEM IS NECESSARY FOR STATISTICAL
ANALYSIS AND FOR PLOTTING.
• FACTORS ARE STORED AS INTEGERS, AND HAVE LABELS
(TEXT) ASSOCIATED WITH THESE UNIQUE INTEGERS.
• WHILE FACTORS LOOK (AND OFTEN BEHAVE) LIKE
CHARACTER VECTORS
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 40
• once created, factors can only contain a pre-defined set of values,
known as levels. by default, r always sorts levels in alphabetical order.
for instance, if you have a factor with 2 levels:
• party <- factor(c("republican", "democrat", "democrat", "republican"))
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 41
R will assign 1 to the level "democrat" and
2 to the level "republican"
(because d comes before r, even though the first element in this
vector is "republican").
check this by using the function levels(), and check the number of
levels using nlevels():
levels(party)
nlevels(party)
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 42
CONVERTING FACTORS
convert a factor to a character vector, you use [Link](x)
• [Link](party)
• [Link](party)
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 43
RENAMING FACTORS
• When your data is stored as a factor, you can use the plot() function to
get a quick glance at the number of observations represented by each
factor level. let’s look at the number of blacks and whites in the dataset:
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 44
• # coerce the column "driver_race" into a factor
• trafficstops$driver_race <- factor(trafficstops$driver_race) # bar plot
of the number of black and white drivers stopped:
• trafficstops$driver_race <- [Link](trafficstops$driver_race)
plot(trafficstops$driver_race)
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 45
WORKING WITH DATES
• THE [Link]() FUNCTION
• THIS FUNCTION ALLOWS US TO CREATE A DATE VALUE
(WITHOUT TIME) IN R PROGRAMMING. IT ALLOWS THE
VARIOUS INPUT FORMATS OF THE DATE VALUE AS WELL
THROUGH THE FORMAT = ARGUMENT.
• STANDARD DATE FORMAT AS “YYYY-MM-DD”
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 46
[Link]() FUNCTION
• date value as an argument.
• to give a date value as an input
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 47
EXAMPLE 2 : [Link]() FUNCTION
• INPUT IS NOT IN PROPER FORMAT
1. TO ARRANGE THE DATE VALUES IN A STANDARD FORM AND
PRESENT IT .
• %D - MEANS A DAY OF THE MONTH IN NUMBER FORMAT
• %M - STANDS FOR THE MONTH IN NUMBER FORMAT
• %Y - STANDS FOR THE YEAR IN THE “YYYY” FORMAT. YEAR
VALUE IN TWO DIGITS
• “%Y” INSTEAD OF “%Y.”
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 48
• month name instead of month number under the input value, we can
use the %b operator under the format = argument while using the
[Link]() function.
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 49
USING THE [Link](), [Link]()
FUNCTION
IN R PROGRAMMING
1. [Link]() FUNCTION, IT WILL GIVE YOU THE SYSTEM DATE.
YOU DON’T NEED TO ADD AN ARGUMENT INSIDE THE
PARENTHESES TO THIS FUNCTION.
2. [Link]() THAT ALLOWS US TO GET THE TIMEZONE
BASED ON THE LOCATION AT WHICH THE USER IS RUNNING
THE CODE ON THE SYSTEM.
3. [Link]() FUNCTION. WHICH, IF USED, WILL RETURN THE
CURRENT DATE AS WELL AS THE TIME OF THE SYSTEM WITH
THE TIMEZONE DETAILS.
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 50
USING THE LUBRIDATE
PACKAGE
• now() that can give us the current date, current time, and the current
timezone details in a single call
• install the package “lubridate.”
• [Link](“lubridate”)
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 51
EXTRACTION AND MANIPULATION
OF THE PARTS OF THE DATE
• THE “LUBRIDATE” PACKAGE WORK, IT BECOMES EASIER TO USE
IT FOR EXTRACTION AND MANIPULATION OF SOME PARTS OF
THE DATE VALUE.
• THERE ARE VARIOUS FUNCTIONS UNDER THE PACKAGE THAT
ALLOW US TO EITHER EXTRACT THE YEAR, MONTH, WEEK, ETC.
FROM THE DATE.
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 52
EXAMPLE CODE FOR EXTRACTION
OF DIFFERENT DATE COMPONENTS
• Create a date variable named “x,” which contains three different date values.
• The year() function allows us to extract the year values for each element of
the vector.
• The month() function takes a single date value or a vector that contains dates
as element and extracts the month from those as numbers.
• What if we wanted the abbreviated names for each month from dates? we
have to add the “label = true” argument under the month() function and
could see the month names in abbreviated form.
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 53
• if we use the “abbr = false” argument under the month function along
with the “label = true,” we will get the full month names.
• to extract the days from the given date values, you can use
the mday() function. you will get the days as numbers.
• the wday() function allows us to get the weekdays in numbers by
default. however, when we use the “label = true” and “abbr =
false” as additional arguments under the function, we will come to
know which day of the given date has which weekday value.
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 54
EXAMPLE CODE WITH OUTPUT
FOR DATES MANIPULATION IN R
• we are using ymd() function on the given vector. this function converts
the date values from the vector into a format that is suitable for the
manipulation.
• we can add or subtract the year values from each element of the vector.
it is similar to adding or subtracting components from a numeric vector.
the function we have used here is years().
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 55
• in the same way, we can use months() to add or subtract the month
values to each vector element.
• we can use the mday() function to update the days for each date from
the given vector.
• the update() function is a combination of these all. this function allows
you to add, years, months, and even days to each element of the given
vector.
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 56
REFERENCES
• HTTPS://[Link]/TARAGONMD/PHDS/WORKING-WI
[Link]
• R PROGRMMING FOR DATA SCIENCE BY ROGER D.
PENG
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 57
MOVIE RECOMMENDATION
SYSTEM
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 58
MOVIE RECOMMENDATION
SYSTEM
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 59