0% found this document useful (0 votes)
115 views59 pages

R-Programming For Data Science

The document discusses using R for data science projects. It provides examples of companies that use R for tasks like classifying customer support texts, analyzing tweets, and creating data visualizations. These include T-Mobile using R for text classification, Twitter analyzing tweets for text, and the Financial Times and BBC creating visualizations in R. The document also covers advantages of R like being open-source, having a large developer community, and powerful libraries for data science tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
115 views59 pages

R-Programming For Data Science

The document discusses using R for data science projects. It provides examples of companies that use R for tasks like classifying customer support texts, analyzing tweets, and creating data visualizations. These include T-Mobile using R for text classification, Twitter analyzing tweets for text, and the Financial Times and BBC creating visualizations in R. The document also covers advantages of R like being open-source, having a large developer community, and powerful libraries for data science tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

R-PROGRAMMING FOR

DATA SCIENCE

[Link]
ASSOCIATE PROFESSOR
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
P.S.R ENGINEERING COLLEGE, SIVAKASI

[Link] ASSOCIATE PROFESSOR , CSE


1
DATA SCIENCE PROJECTS THAT USE R 
Several industries, such as banking, telecommunications, and media, use R for data science.
Following are some real-world examples of data visualization in r. 
1.T-mobile employs R to classify customer support texts in order to connect clients to an agent
appropriately. 
[Link] tweets can be analyzed for text using R. The twitterr package supports text analytics and
scraping of twitter data. 
[Link] analytics can be combined with R to perform statistical data analysis and build
meaningful data visualizations. This can be achieved by installing the rgoogleanalytics package. 
[Link] financial times used R to create data visualizations purely using r and ggplot2 package for
their featured articles such as "is russia-saudi arabia the worst world cup game ever?" 
[Link] uses data visualization in R to generate appealing graphics for its publications. BBC has
developed an R package based on the bbplot package and an R cookbook to standardize their data
visualization graphic creation process. 

[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 2


R FOR DATA SCIENCE

• R is open-source software. 
• R can be used for suitable projects for machine learning and deep
learning model building.  
• R has a huge capability as a statistical tool. 
• R is probably the best visualization tool for depicting insights through
different graphs and charts. 

[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 3


ADVANTAGES OF R
• R is an open-source software platform that helps create interactive graphs
and provides great visual alternatives, making it even more user-friendly. 
• R has a big development community, various developer forums, and a very
friendly community of r enthusiasts. 
• R offers the interface from github as well as an enormous catalog for use in data
analysis and data mining. 
• There are many powerful r libraries for data science. For example, the R package
shiny allows developers to build interactive web applications directly using R. 
• Rmarkdown allows r to support various dynamic and static output formats such as
html, ms word, and pdf. 

[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 4


DISADVANTAGES OF R

• R has a steep learning curve as the R syntax is quite different and hence,
slightly challenging to learn compared to python. 
• R does not offer basic security measures which are essential for production-
grade web applications. 
• The performance of r is slower than python or matlab, and it does perform
memory management i.e., R requires a lot of memory. 

[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 5


SESSION CONTENT
• ADVANCED DATA HANDLING
• RESHAPING DATA
• APPENDING FRAMES
• MERGING DATA FRAMES
• RESHAPING DATA FRAMES
• TABULAR DATA
• WORKING WITH DATES

[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 6


ADVANCED DATA HANDLING

• SENTIMENT ANALYSIS.
• UBER DATA ANALYSIS.
• MOVIE RECOMMENDATION SYSTEM.
• CREDIT CARD FRAUD DETECTION.
• WINE QUALITY PREDICTION.
• CUSTOMER SEGMENTATION.
• SPEECH EMOTION RECOGNITION.
• PRODUCT BUNDLE IDENTIFICATION.
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 7
RESHAPING DATA

•  R - SPLIT, MERGE AND RESHAPE THE DATA FRAME USING


VARIOUS FUNCTIONS.
• TRANSPOSE OF A MATRIX
• JOINING ROWS AND COLUMNS
• MERGING OF DATA FRAMES
• MELTING AND CASTING

[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 8


TRANSPOSE OF A MATRIX

• T() FUNCTION
• TAKES A MATRIX OR DATA FRAME AS AN INPUT AND GIVES
THE TRANSPOSE OF THAT MATRIX OR DATA FRAME AS IT’S
OUTPUT.
• SYNTAX:
T(MATRIX/ DATA FRAME)

[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 9


R PROGRAM TO FIND THE
TRANSPOSE OF A MATRIX
 
FIRST <- MATRIX(C(1:12), NROW=4, BYROW=TRUE)
PRINT("ORIGINAL MATRIX")
FIRST
 
FIRST <- T(FIRST)
PRINT("TRANSPOSE OF THE MATRIX")
FIRST
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 10
OUTPUT

• [1] "ORIGINAL MATRIX"


[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
[4,] 10 11 12

[1] "TRANSPOSE OF THE MATRIX"


[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12

[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 11


APPENDING FRAMES

• JOINING ROWS AND COLUMNS IN DATA FRAME


• IN R, WE CAN JOIN TWO VECTORS OR MERGE TWO DATA
FRAMES USING FUNCTIONS. THERE ARE BASICALLY TWO
FUNCTIONS THAT PERFORM THESE TASKS:
• CBIND():
• WE CAN COMBINE VECTORS, MATRIX OR DATA FRAMES BY
COLUMNS USING CBIND() FUNCTION.

[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 12


• SYNTAX: CBIND(X1, X2, X3)
• WHERE X1, X2 AND X3 CAN BE VECTORS OR MATRICES OR
DATA FRAMES. 

[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 13


RBIND

RBIND():
• WE CAN COMBINE VECTORS, MATRIX OR DATA FRAMES BY
ROWS USING RBIND() FUNCTION.
• SYNTAX: RBIND(X1, X2, X3)
• WHERE X1, X2 AND X3 CAN BE VECTORS OR MATRICES OR
DATA FRAMES.

[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 14


# CBIND AND RBIND FUNCTION IN
R
NAME <- C("SHAONI", "ESHA", "SOUMITRA", "SOUMI")
AGE <- C(24, 53, 62, 29)
ADDRESS <- C("PUDUCHERRY", "KOLKATA", "DELHI", "BANGALORE")

# CBIND FUNCTION
INFO <- CBIND(NAME, AGE, ADDRESS)
PRINT("COMBINING VECTORS INTO DATA FRAME USING CBIND ")
PRINT(INFO)

[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 15


# CREATING NEW DATA FRAME

NEWD <- [Link](NAME=C("SOUNAK", "BHABANI"),


AGE=C("28", "87"),
ADDRESS=C("BANGALORE", "KOLKATA"))

# RBIND FUNCTION
[Link] <- RBIND(INFO, NEWD)
PRINT("COMBINING DATA FRAMES USING RBIND ")
PRINT([Link])
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 16
OUTPUT
[1] "COMBINING VECTORS INTO DATA FRAME USING
CBIND "
NAME AGE ADDRESS
[1,] "SHAONI" "24" "PUDUCHERRY"
[2,] "ESHA" "53" "KOLKATA"
[3,] "SOUMITRA" "62" "DELHI"
[4,] "SOUMI" "29" "BANGALORE"

[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 17


OUTPUT RBIND

• [1] "COMBINING DATA FRAMES USING RBIND "


• NAME AGE ADDRESS

1 SHAONI 24 PUDUCHERRY
2 ESHA 53 KOLKATA
3 SOUMITRA 62 DELHI
4 SOUMI 29 BANGALORE
5 SOUNAK 28 BANGALORE
6 BHABANI 87 KOLKATA
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 18
MERGING TWO DATA FRAMES

• In R, we can merge two data frames using the merge() function


provided both the data frames should have the same column names. We
may merge the two data frames based on a key value.

• Syntax: merge(dfa, dfb, …)

[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 19


MERGING DATA FRAMES

# merging two data frames in r


d1 <- [Link](name=c("shaoni", "soumi", "arjun"),
id=c("111", "112", "113"))

d2 <- [Link](name=c("sounak", "esha"),


id=c("114", "115"))

total <- merge(d1, d2, all=true)


print(total)

[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 20


OUTPUT

NAME ID
1 ARJUN 113
2 SHAONI 111
3 SOUMI 112
4 ESHA 115
5 SOUNAK 114

[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 21


RESHAPING DATA FRAMES

• DATA RESHAPING INVOLVES MANY STEPS IN ORDER TO


OBTAIN DESIRED OR REQUIRED FORMAT.
• ONE OF THE POPULAR METHODS IS MELTING THE DATA
WHICH CONVERTS EACH ROW INTO A UNIQUE ID-VARIABLE
COMBINATION AND THEN CASTING IT.
• THE TWO FUNCTIONS USED FOR THIS PROCESS:

[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 22


MELTING

• MELT():
•  IT IS USED TO CONVERT A DATA FRAME INTO A MOLTEN DATA FRAME.
• SYNTAX: MELT(DATA, …, [Link]=FALSE, [Link]=”VALUE”)
• WHERE, 
• DATA: DATA TO BE MELTED 
… : ARGUMENTS 
[Link]: CONVERTS EXPLICIT MISSINGS INTO IMPLICIT MISSINGS 
[Link]: STORING VALUES

[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 23


CASTING

• DCAST(): 
• IT IS USED TO AGGREGATE THE MOLTEN DATA FRAME INTO A
NEW FORM.
• SYNTAX: MELT(DATA, FORMULA, [Link])
• WHERE, 
• DATA: DATA TO BE MELTED 
FORMULA: FORMULA THAT DEFINES HOW TO CAST 
[Link]: USED IF THERE IS A DATA AGGREGATION
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 24
# MELT AND CAST
library(mass)
library(reshape)
a <- [Link](id=c("1", "1", "2", "2"),points=c("1", "2", "1", "2"),
x1=c("5", "3", "6", "2"), x2=c("6", "5", "1", "4"))
print("melting")
m <- melt(a, id=c("id", "point"))
print(m)
print("casting")
idmn <- dcast(a, id~variable, mean)
print(idmn)

[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 25


MELTING

• MELTING
ID POINTS VARIABLE VALUE
1 1 X1 5
1 2 X1 3
2 1 X1 6
2 2 X1 2
3 1 X2 6
1 2 X2 5
2 1 X2 1
2 2 X2 4

[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 26


CASTING

• CASTING
ID X1 X2
1 4 5.5
2 4 2.5

[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 27


TABULAR DATA

• LOADING TABULAR DATA


• INSPECTING [Link] OBJECTS
• INDEXING AND SUBSETTING DATA FRAMES
• CATEGORICAL DATA: FACTORS
• CONVERTING FACTORS
• RENAMING FACTORS

[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 28


R FUNCTION

• [Link]() to download the csv file that contains the traffic stop
data
[Link]("[Link]
"data/ms_trafficstops_bw.csv")
• [Link]() to load into memory the content of the csv file as an object of
class [Link].
trafficstops <- [Link]("data/ms_trafficstops_bw.csv")

[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 29


DISPLAY THE FIRST 6 LINES

• check the top (the first 6 lines) of this data frame using the
function head():
head(trafficstops)

[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 30


INSPECTING [Link] OBJECT
S
•  INSPECTING THE STRUCTURE OF A DATA FRAME WITH THE
FUNCTION STR():
STR(TRAFFICSTOPS)
THE FUNCTIONS HEAD() AND STR() CAN BE USEFUL TO CHECK
THE CONTENT AND THE STRUCTURE OF A DATA FRAME

[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 31


NON--EXHAUSTIVE LIST OF FUNCTIONS TO
GET A SENSE OF THE CONTENT/STRUCTURE
OF THE DATA.

• size:
• dim(trafficstops) - returns a vector with the number of rows in the first element,
and the number of columns as the second element (the dimensions of the
object)
• nrow(trafficstops) - returns the number of rows
• ncol(trafficstops) - returns the number of columns
• length(trafficstops) - returns number of columns

[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 32


• content:
• head(trafficstops) - shows the first 6 rows
• tail(trafficstops) - shows the last 6 rows

• names:
• names(trafficstops) - returns the column names (synonym
of colnames() for [Link] objects)
• rownames(trafficstops) - returns the row names

[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 33


• summary:
• str(trafficstops) - structure of the object and information about the class,
length and content of each column
• summary(trafficstops) - summary statistics for each column

• most of functions are generic

[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 34


SPECIFYING THESE COORDINATES
LEAD TO RESULTS WITH
DIFFERENT CLASSES
• trafficstops[1, 1] # first element in the first column of the data frame (as
a vector)
• trafficstops[1, 6] # first element in the 6th column (as a vector)
• trafficstops[, 1] # first column in the data frame (as a vector)

[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 35


• trafficstops[1] # first column in the data frame (as a [Link])
• trafficstops[1:3, 7] # first three elements in the 7th column (as a vector)

[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 36


• trafficstops[3, ] # the 3rd row (as a [Link])
• trafficstops[1:6, ] # the 1st to 6th rows, equivalent to head(trafficstops)
• trafficstops[, -1] # the whole data frame, excluding the first column
• trafficstops[-c(7:211211),] # equivalent to head(trafficstops)

[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 37


[Link] (OR MATRIX),
COLUMNS CAN BE CALLED BY
• NAME
TRAFFICSTOPS["VIOLATION_RAW"] # RESULT IS A [Link]
• TRAFFICSTOPS[, "VIOLATION_RAW"] # RESULT IS A VECTOR
• TRAFFICSTOPS[["VIOLATION_RAW"]] # RESULT IS A VECTOR
• TRAFFICSTOPS$VIOLATION_RAW # RESULT IS A VECTOR

[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 38


CONDITIONAL SUBSETTING

• to extract a subset of a data frame based on certain conditions.


• # the condition: # returns a logical vector of the length of the column
• trafficstops$county_name == "webster county" # use this vector to extract
rows and all columns # note the comma: we want *all* columns
• trafficstops[trafficstops$county_name == "webster county", ] # assign
extract to a new data frame
• webster_trafficstops <- trafficstops[trafficstops$county_name == "webster
county", ]
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 39
CATEGORICAL DATA: FACTORS

• FACTORS ARE USED TO REPRESENT CATEGORICAL DATA.


FACTORS CAN BE ORDERED OR UNORDERED, AND
UNDERSTANDING THEM IS NECESSARY FOR STATISTICAL
ANALYSIS AND FOR PLOTTING.
• FACTORS ARE STORED AS INTEGERS, AND HAVE LABELS
(TEXT) ASSOCIATED WITH THESE UNIQUE INTEGERS.
• WHILE FACTORS LOOK (AND OFTEN BEHAVE) LIKE
CHARACTER VECTORS

[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 40


• once created, factors can only contain a pre-defined set of values,
known as levels. by default, r always sorts levels in alphabetical order.
for instance, if you have a factor with 2 levels:
• party <- factor(c("republican", "democrat", "democrat", "republican"))

[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 41


R will assign 1 to the level "democrat" and 
2 to the level "republican" 
(because d comes before r, even though the first element in this
vector is "republican").

check this by using the function levels(), and check the number of


levels using nlevels():
levels(party)
nlevels(party)

[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 42


CONVERTING FACTORS

convert a factor to a character vector, you use [Link](x)


• [Link](party)
• [Link](party)

[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 43


RENAMING FACTORS

• When your data is stored as a factor, you can use the plot() function to


get a quick glance at the number of observations represented by each
factor level. let’s look at the number of blacks and whites in the dataset:

[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 44


• # coerce the column "driver_race" into a factor
• trafficstops$driver_race <- factor(trafficstops$driver_race) # bar plot
of the number of black and white drivers stopped:
• trafficstops$driver_race <- [Link](trafficstops$driver_race)
plot(trafficstops$driver_race)

[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 45


WORKING WITH DATES

• THE [Link]() FUNCTION


• THIS FUNCTION ALLOWS US TO CREATE A DATE VALUE
(WITHOUT TIME) IN R PROGRAMMING. IT ALLOWS THE
VARIOUS INPUT FORMATS OF THE DATE VALUE AS WELL
THROUGH THE FORMAT = ARGUMENT.

• STANDARD DATE FORMAT AS “YYYY-MM-DD”

[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 46


[Link]() FUNCTION

• date value as an argument.


• to give a date value as an input

[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 47


EXAMPLE 2 : [Link]() FUNCTION

• INPUT IS NOT IN PROPER FORMAT


1. TO ARRANGE THE DATE VALUES IN A STANDARD FORM AND
PRESENT IT .
• %D -  MEANS A DAY OF THE MONTH IN NUMBER FORMAT
• %M - STANDS FOR THE MONTH IN NUMBER FORMAT
• %Y - STANDS FOR THE YEAR IN THE “YYYY” FORMAT. YEAR
VALUE IN TWO DIGITS
• “%Y” INSTEAD OF “%Y.” 
[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 48
• month name instead of month number under the input value, we can
use the %b operator under the format = argument while using the
[Link]() function.

[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 49


USING THE [Link](), [Link]()
FUNCTION
IN R PROGRAMMING
1. [Link]() FUNCTION, IT WILL GIVE YOU THE SYSTEM DATE.
YOU DON’T NEED TO ADD AN ARGUMENT INSIDE THE
PARENTHESES TO THIS FUNCTION.
2. [Link]() THAT ALLOWS US TO GET THE TIMEZONE
BASED ON THE LOCATION AT WHICH THE USER IS RUNNING
THE CODE ON THE SYSTEM.
3. [Link]() FUNCTION. WHICH, IF USED, WILL RETURN THE
CURRENT DATE AS WELL AS THE TIME OF THE SYSTEM WITH
THE TIMEZONE DETAILS.

[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 50


USING THE LUBRIDATE
PACKAGE
• now() that can give us the current date, current time, and the current
timezone details in a single call 
• install the package “lubridate.”
• [Link](“lubridate”)

[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 51


EXTRACTION AND MANIPULATION
OF THE PARTS OF THE DATE

• THE “LUBRIDATE” PACKAGE WORK, IT BECOMES EASIER TO USE


IT FOR EXTRACTION AND MANIPULATION OF SOME PARTS OF
THE DATE VALUE. 
• THERE ARE VARIOUS FUNCTIONS UNDER THE PACKAGE THAT
ALLOW US TO EITHER EXTRACT THE YEAR, MONTH, WEEK, ETC.
FROM THE DATE.

[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 52


EXAMPLE CODE FOR EXTRACTION
OF DIFFERENT DATE COMPONENTS

• Create a date variable named “x,” which contains three different date values.
• The year() function allows us to extract the year values for each element of
the vector.
• The month() function takes a single date value or a vector that contains dates
as element and extracts the month from those as numbers.
• What if we wanted the abbreviated names for each month from dates? we
have to add the “label = true” argument under the month() function and
could see the month names in abbreviated form. 

[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 53


• if we use the “abbr = false” argument under the month function along
with the “label = true,” we will get the full month names.
• to extract the days from the given date values, you can use
the mday() function. you will get the days as numbers.
• the wday() function allows us to get the weekdays in numbers by
default. however, when we use the “label = true” and “abbr =
false” as additional arguments under the function, we will come to
know which day of the given date has which weekday value.

[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 54


EXAMPLE CODE WITH OUTPUT
FOR DATES MANIPULATION IN R
• we are using ymd() function on the given vector. this function converts
the date values from the vector into a format that is suitable for the
manipulation.
• we can add or subtract the year values from each element of the vector.
it is similar to adding or subtracting components from a numeric vector.
the function we have used here is years().

[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 55


• in the same way, we can use months() to add or subtract the month
values to each vector element.
• we can use the mday() function to update the days for each date from
the given vector.
• the update() function is a combination of these all. this function allows
you to add, years, months, and even days to each element of the given
vector.

[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 56


REFERENCES

• HTTPS://[Link]/TARAGONMD/PHDS/WORKING-WI
[Link]

• R PROGRMMING FOR DATA SCIENCE BY ROGER D.


PENG

[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 57


MOVIE RECOMMENDATION
SYSTEM

[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 58


MOVIE RECOMMENDATION
SYSTEM

[Link] ASSOCIATE PROFESSOR , CSE DEPT,PSREC 59

You might also like