0% found this document useful (0 votes)
33 views55 pages

BADSIS Lec 19-20 Sep 9 SR R Programming

Random things

Uploaded by

Mudit Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views55 pages

BADSIS Lec 19-20 Sep 9 SR R Programming

Random things

Uploaded by

Mudit Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

INDIAN INSTITUTE OF TECHNOLOGY ROORKEE

Module 5
Lecture 19-20: Introduction to R Programming

Dr. Sudip Roy


Associate Professor, Department of Computer Science and Engineering
Indian Institute of Technology (IIT) Roorkee
Roorkee - 247667, Uttarakhand, India

Date: September 9, 2025


Introduction

R is a programming language and


R is widely used by statisticians,
free software environment for
data analysts and researchers for
statistical computing and graphics
developing statistical software
supported by the R Foundation
and data analysis.
for Statistical Computing.

The copyright for the primary


It compiles and runs on a wide source code for R is held by the R
variety of UNIX platforms, Foundation and is published
Windows and Mac OS. under the GNU General Public
License version 2.0.

2
History of R

•R was created by Ross Ihaka and Robert Gentleman


at the University of Auckland, New Zealand.

•Currently R is developed & maintained by the R


Development Core Team.

•The applications of R programming language include:


1.Statistical Computing
2.Machine Learning
3.Data Science

•R can be downloaded and installed from CRAN


(Comprehensive R Archive Network) website.

https://www.researchgate.net/publication/360246719_The_R_Language_An_Engine_for_Bioinformatics_and_Data_Science/figures?lo=1&utm_source=google&utm_medium=organic

3
Why R?

Research & Academic Adoption – New statistical methods and models often appear in R
first, due to its dominance in academia.
https://www.geeksforgeeks.org/r-language/r-programming-language-introduction/

4
Features of R

Fast Calculation – R can be used to perform complex mathematical and statistical


calculations on data objects of a wide variety.

Extreme Compatibility – R is an interpreted language which means it does not need


a compiler to make a program from the code.

Open Source – R is an open-source software environment. You can make


improvements and add packages for additional functionalities.

Cross Platform Support – R is machine-independent. It supports cross-platform


operation and can be used on many different operating systems.

Wide Packages – CRAN houses more than 10,000 different packages and
extensions that help solve all sorts of problems in data science.

Large Standard Library – R can produce static graphics with production-quality


visualizations and has extended libraries providing interactive graphic capabilities.

5
Installation of R
Download Rstudio and R from: https://posit.co/download/rstudio-desktop/

6
Installation of R Studio

7
Getting to know R Studio
R - Studio
•RStudio is available in two editions:
• RStudio Desktop, where the program is run locally as a regular desktop application.
• RStudio Server, prepackaged distributions of RStudio Desktop are available for
Windows, OS X, and Linux.

Interacting with R Studio –


•R-Studio is a free and open-source integrated development environment (IDE) for R, a
programming language for statistical computing and graphics.

•R-Studio was founded by JJ Allaire, creator of the programming language ColdFusion.

•There are 4 main sections in R–Studio IDE…


• Code Editor
• Workspace and History
• R console
• Plots and Files

8
Components of R Studio

The Environment pane displays


temporary R objects as created during
that R session.
The Source pane is where you can edit
and save R or Python scripts or author
computational documents like Quarto and
R Markdown.

The Console pane is used to write


short interactive R commands. The Output pane displays the plots,
tables, or HTML outputs of executed code
along with files saved to disk.

9
Basics of R
Code: Output:

1 Data Types

2 Variables

3 Keywords

4 Operators

5 Data Structures

10
R Syntax Basics

A program in R is made up of three things: Variables, Comments, and Keywords.

• Variables are used to store the data


• Comments are used to improve code readability
• Keywords are reserved words that hold a specific meaning to the compiler.

Variables
Variables which like any other programming language are the name given to reserved
memory locations that can store any type of data.
In R, the assignment can be denoted in three ways:

• = (Simple Assignment)
• <- (Leftward Assignment)
• -> (Rightward Assignment)

https://www.scaler.com/topics/images/variables-in-r-thumbnail.webp

11
R Syntax Basics cont’d.

Comments in R

• Comments are a way to improve your code's


readability and are only meant for the user,
so the interpreter ignores it.
• Only single-line comments are available in
R. Single line comments can be written by
using # at the beginning of the statement.

Keywords in R

• Keywords are the words reserved by


a program because they have a
special meaning thus a keyword can't
be used as a variable name, function
name, etc.
• Keywords can be viewed by using
either help(reserved) or ?reserved.

https://i.sstatic.net/Dw6Ln.png https://cdn.discuss.boardinfinity.com/original/2X/a/a8973dae9446c9a77f12edcf2f3c7c2e74b4534f.png

12
Datatypes

Basic data types in R can be divided into the following types:


•numeric - (10.5, 55, 787)
•integer - (1L, 55L, 100L, where the letter "L" declares this as an integer)
•complex - (9 + 3i, where "i" is the imaginary part)
•character (a.k.a. string) - ("k", "R is exciting", "FALSE", "11.5")
•logical (a.k.a. boolean) - (TRUE or FALSE)

We can use the class() function to check the data type of a variable:

Code: Output:

13
Variables and Keywords

Nomenclature for R variable naming


Start without
Valid characters No Special Starting Dot Before Avoid Reserved
Number or
Characters Characters Number Keywords
Underscore
A variable name • Only dots (.) and • A variable name • A variable name • If a variable name • A variable name
can include underscores (_) can start with a cannot begin starts with a dot cannot be the
letters (a-z, A-Z), are allowed. letter or a dot (.). with a number or (.), the character same as a
numbers (0-9), Other special an underscore. following the dot reserved
dots (.), and characters like $ cannot be a keyword in R,
underscores (_). or # are not number. such as TRUE,
permitted FALSE, NA, etc.

Code: Output:

14
Operators

R divides the operators in the following groups:

Arithmetic Operators Assignment Operators Comparison Operators Logical Operators Miscellaneous Operators

+ Addition & Element-wise Logical : Creates a series of


<- == Equal
AND operator numbers in a sequence
- Subtraction
<<- != Not equal && Logical AND
operator %in% Find out if an
* Multiplication element belongs to a
-> > Greater than | Elementwise- Logical
/ Division vector
->> < Less than OR operator
%*% Matrix
^ Exponent >= Greater than or || Logical OR operator Multiplication
equal to ! Logical NOT
%% Modulus <= Less than or equal
to
%/% Integer Division

15
Operators cont’d.
Arithmetic Operators Output

Comparison Operators Output

16
Conditional Statements
If Statement
An "if statement" is written with the if keyword, and it is used to specify a block of code to be executed if
a condition is TRUE

Else If
The else if keyword is R's way of saying "if the previous conditions were not true, then try this
condition":

Else
The else keyword catches anything which isn't caught by the preceding conditions

17
Loops

For loop While loop Repeat loop

The for loop is used when we The while loop runs as long as The repeat loop executes indefinitely
know the exact number of a specified condition until explicitly stopped using the
iterations required. It iterates over holds TRUE. It is useful when break statement. To terminate the
a sequence such as a vector, list or the number of iterations is repeat loop, we use a jump statement
numeric range. unknown beforehand. that is the break keyword.
Syntax Syntax Syntax
for (value in sequence) while ( condition ) repeat
{ { {
statement statement
statement
} if( condition )
} {
break
}}

https://www.geeksforgeeks.org/r-language/loops-in-r-for-while-repeat/
18
Loops cont’d.

For loop While loop Repeat loop

Output: Output: Output:

19
Functions in R
A function is a block of code which only runs when it is called. We can pass data, known as parameters, into a
function.
• A function can return data as a result.
• To create a function, use the function() keyword
There are several ways we can pass the arguments to the
function:

• Case 1: Arguments are passed to the function in the


same order as in the function definition.

• Case 2: We can pass the arguments using the names


of the arguments in any order.

• Case 3: If the arguments are not passed the default


values are used to execute the function
Output:
Code:

20
Built-in functions

Built-in Function: Built-in functions in R are pre-defined functions that are available in R
programming languages to perform common tasks or operations.

Statistical
Numeric functions Character functions Others
Functions
• abs(x) – absolute • strsplit(x , split) – • mean(x , •seq(from , to , b
value split the elements trim=0,na.rm= FA y) - generate a
• sqrt(x) – square of character LSE ) – mean of sequence
root vector x object x •rep(x , ntimes) –
• log(x) – • toupper(x) – upper • scale(x , repeat x n times
logarithm case center= TRUE ,
• Round(x , digits • paste(..., sep="") – scale= TRUE ) –
= n) – rounding concatenates column center or
off strings after standardize a
using sep string to matrix.
separate them • dnorm(x) –
normal density
function (by
default m=0 sd=1)

21
User defined functions
• User-defined functions are the functions that are created by the user.
• The User defines the working, parameters, default parameter, etc. of that user-defined
function.
• They can be only used in that specific code.
Code:

Output:

22
Types of Data Structures in R

A data structure is a particular way of organizing data in a computer so that it can be used
effectively.
• R’s base data structures are often organized by their dimensionality (1D, 2D or nD)
• homogeneous (all elements must be of the identical type)
• heterogeneous (the elements are often of various types).

There six data types:

Vectors Lists Data Frames

Matrices Arrays Factors

23
Vectors

• A vector is an ordered collection of basic data types of a given length.

• The only key thing here is all the elements of a vector must be of the identical data type,
e.g., homogeneous data structures.

• Vectors are one-dimensional data structures.

Output
Code:

24
Lists

• A list is a generic object consisting of an ordered collection of objects.

• Lists are heterogeneous data structures. These are also one-dimensional data structures.

• A list can be a list of vectors, list of matrices, a list of characters and a list of functions and so
on.

Code: Output:

25
Data Frames

Data frames are used to store the tabular data. They are two-dimensional, heterogeneous
data structures. These are lists of vectors of equal lengths.
Data frames have the following constraints placed upon them:
•A data-frame must have column names, and every row should have a unique name.

•Each column must have the identical number of items.

•Each item in a single column must be of the same data type.

•Different columns may have different data types.

Code: Output:

26
Matrices

• A matrix is a rectangular arrangement of numbers in rows and columns.


• Matrices are two-dimensional, homogeneous data structures.

Code: Output:

27
Arrays

• Arrays are the R data objects which store the data in more than two dimensions.
• Arrays are n-dimensional data structures. An array of dimensions (2, 3, 3) creates 3
rectangular matrices each with 2 rows and 3 columns.
• They are homogeneous data structures.

Code: Output:

28
Factors

• Factors are the data objects which are used to categorize the data and store it as levels.
They are useful for storing categorical data.
• They can store both strings and integers.
• They are useful to categorize unique values in columns like (“TRUE” or “FALSE”) or
(“MALE” or “FEMALE”), etc..
• They are useful in data analysis for statistical modeling.

Code: Output:

29
Installing Packages

30
Reading excel files in R
To read excel files on R we use the readxl package.
The openxlsx package includes the ability to write to XLSX files
Another method for installing
packages in R

Calling the library after installation

Setting the path to the excel file


Read the excel file from the path and
assign to a variable as a dataframe

Read a particular sheet from the excel


Display the dataframe workbook

31
Reading excel files in R
Reading the excel file by skipping first two rows

Reading the excel file upto 1000 rows

Reading the excel file from 3rd row Column A to


10th row column E

32
Reading excel files in R

Assigning column names to a list


Reading data that has no header row
by setting the col_names to a
character vector. Skip is used to
Obtaining the column types remove header row

Turning space separated header names into syntactic R


Overriding the variables with .name_repair = "universal" argument.
column types

33
Reading csv files in R
• read.csv() function is used to read "comma separated value" files.
• It imports data in the form of a data frame.
• The read.csv() function also accepts a number of optional arguments that we can use to modify the import
procedure.
• We can choose to treat the first row as column names, select the delimiter character, and more.

Displays the first 5 rows of the dataset

34
Data Manipulation in R

• Data manipulation in R involves cleaning, transforming, and organizing data to make it


suitable for analysis.
• It includes tasks like selecting, filtering, sorting, and creating new variables

Function Name Description

filter() Produces a subset of a Data Frame.

distinct() Removes duplicate rows in a Data Frame

arrange() Reorder the rows of a Data Frame

select() Produces data in required columns of a Data Frame

rename() Renames the variable names

mutate() Creates new variables without dropping old ones.

transmute() Creates new variables by dropping the old.

summarize() Gives summarized data like Average, Sum, etc.

https://www.geeksforgeeks.org/r-language/data-manipulation-in-r-with-dplyr-package/ 35
tidyr() in R

• tidyr package is to simplify the process of creating tidy data.


• Tidy data describes a standard way of storing data that is used
wherever possible throughout the tidyverse.

• gather() function: It takes multiple columns and gathers them


into key-value pairs.

# using gather() function on tidy_dataframe


long <- tidy_dataframe %>%
gather(Group, Frequency,
Group.1:Group.3)# print the data frame in a long
formatlong

Original dataframe

36
tidyr() in R
• full_seq() function: It fills the missing •fill() function: Used to fill in the missing
values in a vector which should have values in selected columns using the previous
been observed but weren’t. entry.

• The vector should be numeric. •Missing values are replaced in atomic


vectors; NULL is replaced in the list.
# creating a numeric vector
num_vec <- c(1, 7, 9, 14, 19, 20) # import the tidyr packagedf <- data.frame(Month = 1:6,
# use full_seq() to fill missing Year = c(2000, rep(NA, 5)))# print the df data framedf
# values in num_vec # Output (i)# use fill() to fill missing values in # Year
full_seq(num_vec, 1) column in df data framedf %>% fill(Year) # Output (ii)

37
tidyr() in R
•drop_na() function: This function drops •replace_na() function: It replaces missing
rows containing missing values. values.

# create a tibble df with missing values df <- data.frame(S.No = c(1:10), Name = c('John', 'Smith',
df <- tibble(S.No = c(1:10), Name = c('John', 'Smith’, 'Peter', 'Luke', 'King', rep(NA, 5)))
'Peter', 'Luke', 'King', rep(NA, 5))) df # Output (i)
# print df tibble # use replace_na() to replace missing values or na
df # Output (i) df %>% replace_na(list(Name = 'Henry')) # Output (ii)
# use drop_na() to drop columns
df %>% drop_na(Name) # Output (ii)

38
Descriptive statistics in R
Descriptive statistics are techniques used to summarize and find the key characteristics of a
dataset.
A dataset can be summarized in two ways: Mean
1. Quantitative Descriptive Statistics
Median
2. Graphical Descriptive Statistics
Mode
Quantitative
Analysis Variance
Standard Deviation
Quartiles

Descriptive Interquartile Range


Statistics
Histogram
Boxplot
Scatter Plot
Graphical Analysis Q-Q Plot
Line Plot
Correlation Plot
Density Plot

39
Quantitative Descriptive Statistics
Minimum value in Sepal Length column

Minimum value in Sepal Length column

Mean value of Sepal Length column

Median value of Sepal Length column

1st Quartile of Sepal


Length column

2nd Quartile of Sepal


Length column

3rd Quartile of Sepal


Length column
Standard deviation of
Sepal Length column
Variance of Sepal
Length column

Quick summary of the data, including the minimum,


1st quartile, median, 3rd quartile and maximum.

40
Quantitative Descriptive Statistics cont’d.

Descriptive summary of all


the columns of the dataset

Descriptive summary of
each specie of the
dataset

41
Graphical Descriptive Statistics

1. Histogram 2. Boxplot
• Visualize data distribution, check for skewness and • Compare distributions, detect outliers and analyze
outliers. spread.

42
Graphical Descriptive Statistics cont’d.

3. Scatter Plot 4. Q-Q Plot


• Explore relationships or correlations between two • Check if data follows a specific distribution (e.g.,
variables. normality).

43
Graphical Descriptive Statistics cont’d.

5. Line Plot 6. Correlation Plot


• Visualize trends or changes over time or • Assess pairwise correlations between multiple
sequence. variables.

44
ggplot2() in R
• ggplot2 is a popular data visualization package in the R programming language
• ggplot is a versatile R graphics library that allows for customization of graphics by adding layers.
• It simplifies creating ready-to-publish charts and includes themes for personalizing charts, allowing
for changes in colors, line types, typefaces, and alignment.

Data: The raw data that you want to plot.

Geometries geom_: The geometric shapes used to visualize the data

Aesthetics aes(): Aesthetics pertaining to size, shape, color

Scales scale_: set of values for each aesthetic mapping in the plot

Statistical transformations stat_: calculates the different data values


in plot
Coordinate system coord_: mapping geometric objects by mapping
coordinates

Facets facet_: a grid of plots is displayed for groups of data.

Visual themes theme(): visual elements of a plot

45
ggplot2() in R
• Plotting a scatter plot using ggplot(). To add • Plotting a scatter plot using ggplot(). A
the geom layer, addition (+) operator is used colour parameter is applied to the
aesthetics. The colour is set to species.
ggplot(iris, aes(x=Sepal.Length, ggplot(iris, aes(x=Sepal.Length, y=Petal.Length,
y=Petal.Length))+geom_point() col=Species))+geom_point()

46
ggplot2() in R
• Plotting a scatter plot using ggplot(). • Plotting a scatter plot using ggplot(). A
Different species are denoted by different smoothened curve is made to pass through
shapes each set.
plot(iris, aes(x=Sepal.Length, y=Petal.Length, col=Species, ggplot(iris, aes(x=Sepal.Length, y=Petal.Length,
shape=Species))+geom_point() col=Species))+geom_point() +geom_smooth(se = FALSE)

47
ggplot2() in R
• Plotting a bar plot using ggplot(). • A scatter plot of horsepower vs mpg is made
and then stat summary to draw the mean.
ggplot(mtcars, aes(x = gear)) +geom_bar()
ggplot(mtcars, aes(hp, mpg)) + geom_point(color =
"red")+ stat_summary(fun.y = "mean", geom = "line",
linetype = "dotted")

48
ggplot2() in R
• Plotting a histogram using ggplot() with mpg • Plotting a boxplot using ggplot(). Cyl is a
as x axis categorical variable based on which 3
boxplots for each class is generated.
ggplot(mtcars,aes(x=mpg)) + geom_histogram() ggplot(mtcars, aes(x=as.factor(cyl), y=mpg)) +
geom_boxplot()

49
ggplot2() in R
• Plotting a boxplot with colours • Plotting a violin plot using ggplot().

mtcars$cyl <- as.factor(mtcars$cyl)ggplot(mtcars, aes(x = cyl,


y = mpg, fill = cyl)) +geom_boxplot() +scale_fill_manual(values ggplot(mtcars, aes(factor(cyl), mpg))+
= c("lightblue", "lightpink", "lightgreen")) geom_violin(aes(fill = cyl))+scale_fill_manual(values =
c("lightblue", "lightyellow", "lightgreen"))

50
ggplot2() in R
• Plotting a piechart using ggplot() with the 3 • 2D density contour plot between
classes of ‘cyl’ horsepower and mpg generated using
ggplot()
ggplot(mtcars, aes(x="", y=mpg, fill=cyl)) +
geom_bar(stat="identity", width=1) + coord_polar("y", start=0) ggplot(mtcars, aes(mpg, hp)) +
geom_density_2d_filled(show.legend = FALSE) +
coord_cartesian(expand = FALSE) + labs(x = "mpg")

51
ggplot2() in R

• 'GGally' extends 'ggplot2' to reduce the


complexity of combining geometric
objects with transformed data.. ‘ggpairs’
is used to build a great scatterplot
matrix.

• Scatterplots of each pair visualized in left


side of the plot

• Pearson correlation value and


significance displayed on the right side.

library(GGally)
ggpairs(mtcars,columns = 1:4,aes(color = cyl,
alpha = 0.5))

52
ggplot2() in R

• A correlogram, or a correlation matrix, is


used to find the relationship between each
pair of numeric variables in a dataset.

• It provides a high-level summary of the


entire dataset, offering insights into the
strength and direction of the relationships.

• This visual representation is particularly


useful for exploratory data analysis, aiding in
the identification of potential patterns or
trends among variables.
library(ggcorrplot)
data(mtcars)corr <- round(cor(mtcars), 1)
ggcorrplot(corr, hc.order = TRUE,
type = "lower",
lab = TRUE,
lab_size = 3,
method="circle",
colors = c("blue","white", "tomato2"),
title="Correlogram of mtcars",
ggtheme=theme_bw)

53
Conclusions:
R provides a simple and consistent syntax for data manipulation, statistical analysis, and model
building.

Packages like ggplot2 and plotly enable powerful graphical visualizations, making data exploration
intuitive and visually appealing.

Vectorization and in-built functions in R make coding efficient and concise.

Data handling is seamless with tools like dplyr, tidyr, and data.table for preprocessing and
wrangling.

Machine learning workflows in R are supported through packages such as caret, mlr3, and
tidymodels.

R integrates statistical rigor with ML algorithms, offering regression, classification, clustering, and
ensemble methods.

Visualization and ML models can be combined for effective interpretation and decision-making.

R remains a versatile environment for both beginners and advanced users, bridging the gap
between data science, statistics, and AI/ML.

54
Thank You!

55

You might also like