0% found this document useful (0 votes)
16 views61 pages

R Language Notes

R is a programming language primarily used for statistical computing and data analysis, developed by Ross Ihaka and Robert Gentleman. It is open-source, offering extensive data handling capabilities, statistical functions, and strong graphical capabilities, making it popular among statisticians and data scientists. The document also outlines the installation process, data types, data structures, package management, and various operators in R.

Uploaded by

Dharun Mathews
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views61 pages

R Language Notes

R is a programming language primarily used for statistical computing and data analysis, developed by Ross Ihaka and Robert Gentleman. It is open-source, offering extensive data handling capabilities, statistical functions, and strong graphical capabilities, making it popular among statisticians and data scientists. The document also outlines the installation process, data types, data structures, package management, and various operators in R.

Uploaded by

Dharun Mathews
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

UNIT – 1

Meaning of R Language
R is a programming language and software environment used mainly for statistical computing, data
analysis, and graphical representation. It was developed by Ross Ihaka and Robert Gentleman at the
University of Auckland, New Zealand. R is widely used by statisticians, data scientists, and researchers for
analyzing data, creating visualizations, and performing machine learning and predictive modeling tasks.
It is an open-source language, meaning anyone can freely use and modify it. R provides a wide range of
built-in functions and packages that simplify complex data operations.
Features of R Language
1. Open Source and Free
R is completely free to download and use. It is open-source software, so users can modify and enhance its
code according to their needs without any licensing fees.
2. Data Handling and Storage
R offers excellent data handling capabilities. It can easily manage vectors, lists, matrices, and data frames,
making it ideal for structured and unstructured data analysis.
3. Statistical and Mathematical Functions
R provides a vast library of statistical tools such as mean, median, standard deviation, regression models,
hypothesis testing, and time-series analysis. This makes it a powerful tool for research and analytics.
4. Graphical Capabilities
R has strong visualization features. It can create a wide variety of graphs like histograms, bar charts, scatter
plots, and advanced 3D visualizations using libraries like ggplot2.
5. Extensible with Packages
Thousands of packages are available in CRAN (Comprehensive R Archive Network) that extend R’s
capabilities for specialized tasks such as data mining, machine learning, bioinformatics, and finance.
6. Cross-Platform Compatibility
R can run on different operating systems like Windows, macOS, and Linux, ensuring flexibility and
portability for users.
7. Integration with Other Languages
R can integrate with other programming languages like C, C++, Java, and Python, allowing developers to
use R alongside other tools.
8. Active Community Support
R has a large and active community of developers, researchers, and data analysts who contribute packages,
provide tutorials, and assist users through online forums and documentation.
Advantages of R Language (5 Points)
1. Open Source and Free – R is freely available to everyone, making it cost-effective for students,
researchers, and professionals.
2. Strong Statistical and Analytical Support – It offers powerful tools for statistical analysis, data
manipulation, and advanced modeling.
3. Excellent Data Visualization – R provides high-quality graphical tools to create charts, plots, and
interactive visualizations.
4. Extensive Package Collection – Thousands of packages are available in CRAN for various fields
like finance, machine learning, and bioinformatics.
5. Cross-Platform and Community Support – R works on all major operating systems and has a
large, active global community for support and updates.
Purpose of R Language
The main purpose of R Language is to perform data analysis, statistical computing, and graphical
visualization. It is designed to help researchers, data analysts, and statisticians:
• Collect, clean, and organize data efficiently.
• Analyze data using statistical and mathematical methods.
• Visualize results through graphs and charts.
• Build predictive models using machine learning algorithms.
• Support research and academic work in data-driven fields.
Step-by-Step Installation Process of R and RStudio
Step 1: Download R
1. Go to the official R website: https://cran.r-project.org
2. Select your operating system (Windows / macOS / Linux).
3. For Windows, click "Download R for Windows" → "base" → then click "Download R-x.x.x for
Windows" (the latest version).
Step 2: Install R
1. Once the file is downloaded, open it.
2. Click Next to proceed through the setup wizard.
3. Choose your installation path (default is fine).
4. Select components (keep all selected).
5. Click Next until installation completes.
6. Click Finish to close the setup window.
Step 3: Download RStudio
1. Visit https://posit.co/download/rstudio/ (formerly RStudio).
2. Click “Download RStudio Desktop” (Free version).
3. Choose your operating system and download the installer file.
Step 4: Install RStudio
1. Open the downloaded RStudio installer file.
2. Follow the on-screen instructions (Next → Install → Finish).
3. Once installed, open RStudio. It will automatically detect your R installation.
Step 5: Verify Installation
1. Open RStudio.
2. In the Console window, type the following command and press Enter:
3. version
This will display your installed R version details, confirming successful installation.
Different Data Types in R Language
R supports several basic data types that define the kind of data a variable can store.
These are: Numeric, Integer, Character, Logical, Complex, and Raw.
1. Numeric Data Type
Numeric type represents decimal or real numbers. It is used for most mathematical operations.
Example Code:
• # Numeric data type example • class(x) # Output: "numeric"
• x <- 10.5 • class(y) # Output: "numeric"
• y <- 20
2. Integer Data Type
Integer type represents whole numbers without decimals.
You can specify integers by adding “L” after the number.
Example Code:
• # Integer data type example • class(a) # Output: "integer"
• a <- 25L • class(b) # Output: "integer"
• b <- 100L
3. Character Data Type
Character type stores text or string values.
Strings are enclosed within single (‘ ’) or double (“ ”) quotes.
Example Code:
• # Character data type example • class(name) # Output: "character"
• name <- "Dharun" • class(city) # Output: "character"
• city <- 'Chennai'
4. Logical Data Type
Logical type stores Boolean values — either TRUE or FALSE.
It is used in conditions and comparisons.
Example Code:
• # Logical data type example • class(x) # Output: "logical"
• x <- 5 > 3 • class(y) # Output: "logical"
• y <- 10 < 8
5. Complex Data Type
Complex type is used to store numbers with both real and imaginary parts.
Example Code:
• # Complex data type example • z1 <- 3 + 2i
• z2 <- 5 - 4i • class(z2) # Output: "complex"
• class(z1) # Output: "complex"
6. Raw Data Type
Raw type is used to store data in its raw byte form.
It is useful for low-level operations like file or binary data handling.
Example Code:
• # Raw data type example • r
• r <- charToRaw("ABC") • class(r) # Output: "raw"
DATA STRUCTURES IN R LANGUAGE
A data structure in R is a way to store and organize data efficiently for analysis and computation.
R provides different types of data structures depending on how data elements are arranged and accessed.
1. VECTOR
A vector is the simplest data structure in R.
It is a sequence of elements that are of the same data type (numeric, character, or logical).
Example Code:
# Vector examples
✓ numeric_vector <- c(10, 20, 30, 40)
✓ character_vector <- c("R", "Language", "Learning")
✓ logical_vector <- c(TRUE, FALSE, TRUE)
✓ # Display class
✓ class(numeric_vector) # Output: "numeric"
✓ print(character_vector)
Note: All elements in a vector must be of the same type.
2. LIST
A list can hold elements of different data types such as numbers, strings, vectors, or even other lists.
Lists are useful for storing complex or mixed data.
Example Code:
✓ # List example
✓ my_list <- list(Name = "Dharun", Age = 22, Marks = c(80, 85, 90))
✓ print(my_list)
✓ # Access elements
✓ my_list$Name
Note: Lists are flexible because they can store varied data under one object.
3. MATRIX
A matrix is a two-dimensional data structure that holds elements of the same data type arranged in rows
and columns.
Example Code:
✓ # Matrix example
✓ matrix_data <- matrix(1:9, nrow = 3, ncol = 3)
✓ print(matrix_data)
✓ # Access elements
✓ matrix_data[2,3] # Element in 2nd row, 3rd column
4. ARRAY
An array is similar to a matrix but can have more than two dimensions.
It can store data in multiple layers.
Example Code:
✓ # Array example
✓ array_data <- array(1:12, dim = c(3, 2, 2))
✓ print(array_data)
5. DATA FRAME
A data frame is a table-like data structure where each column can contain different data types (numeric,
character, logical).
It is one of the most commonly used structures for datasets.
Example Code:
# Data frame example
student_data <- data.frame(
Name = c("Dharun", "Ravi", "Kumar"),
Age = c(22, 21, 23),
Marks = c(85, 90, 88)
)
✓ print(student_data)
✓ # Access specific column
✓ student_data$Name
Note: Data frames are widely used for importing and analyzing datasets in R.
6. FACTOR
A factor is used to store categorical data such as gender, grade, or region.
It assigns integer values to represent text labels internally.
Example Code:
# Factor example
gender <- factor(c("Male", "Female", "Female", "Male"))
print(gender)
levels(gender)
Note: Factors are important for statistical modeling and categorical data analysis.
SUMMARY TABLE
Data Dimension Data Type Description Example
Structure
Vector 1D Same Sequence of elements of same c(1,2,3,4)
type
List 1D Different Collection of elements of list("R", 10, TRUE)
different types
Matrix 2D Same Elements arranged in rows and matrix(1:9, 3, 3)
columns
Array nD Same Multi-dimensional extension of array(1:12, c(3,2,2))
matrix
Data Frame 2D Different Table-like structure with mixed data.frame(Name, Age)
columns
Factor 1D Categorical Stores categorical variables factor(c("Male","Female"))
STEPS IN LOADING PACKAGES IN R
Meaning
✓ In R, a package is a collection of functions, data, and documentation that extends the capabilities
of R.
For example, packages like ggplot2, dplyr, and readxl are used for data visualization, manipulation,
and Excel file handling.
✓ Before using a package, it must be installed and then loaded into the R environment.
Steps to Load Packages in R
Step 1: Install the Package
✓ Before loading a package for the first time, you must install it from CRAN (Comprehensive R
Archive Network).
Use the install.packages() function to do this.
Example Code:
# Step 1: Installing a package
install.packages("ggplot2")
✓ Explanation:
This command downloads and installs the package ggplot2 from CRAN into your system library.
Step 2: Load the Package into R
After installation, load the package into the current R session using the library() function.
Example Code:
# Step 2: Loading the package
library(ggplot2)
Explanation:
This command loads ggplot2 so that its functions can be used in your R program.
Step 3: Check if the Package is Loaded
You can verify whether a package is loaded using the search() or sessionInfo() functions.
Example Code:
# Step 3: Checking loaded packages
✓ search() # Shows currently loaded packages
✓ sessionInfo() # Displays R version and attached packages
Step 4: Use Functions from the Package
Once the package is loaded, you can start using its functions directly.
Example Code:
# Step 4: Using a function from ggplot2 package
✓ data(mpg)
✓ ggplot(mpg, aes(x=displ, y=hwy)) + geom_point()
Step 5: Update or Remove a Package (Optional)
If needed, you can update or uninstall a package using these commands:
Example Code:
✓ # Updating a package
✓ update.packages("ggplot2")
✓ # Removing a package
✓ remove.packages("ggplot2")
Summary Table
Step Command Purpose
Step 1 install.packages("packagename") Installs the package from CRAN
Step 2 library(packagename) Loads the package into the R session
Step 3 search() or sessionInfo() Checks loaded packages
Step 4 Use package functions Perform tasks using the package
Step 5 update.packages() / remove.packages() Update or uninstall packages
DIFFERENT TYPES OF OPERATORS IN R STUDIO
Meaning:
Operators in R are symbols that tell the R interpreter to perform specific mathematical or logical
computations.
They are used to manipulate variables and values in expressions.
R supports several types of operators such as arithmetic, relational, logical, assignment, and
miscellaneous.
1. Arithmetic Operators
These operators are used to perform basic mathematical calculations like addition, subtraction,
multiplication, and division.
Operator Description Example Result
+ Addition 10 + 5 15
- Subtraction 10 - 5 5
* Multiplication 10 * 5 50
/ Division 10 / 5 2
^ or ** Exponentiation 2^3 8
%% Modulus (remainder) 10 %% 3 1
%/% Integer division 10 %/% 3 3
Example Code:
✓ # Arithmetic Operators ✓ mod <- x %% y
✓ x <- 10 ✓ int_div <- x %/% y
✓ y <- 3
✓ add <- x + y
✓ sub <- x - y ✓ print(add)
✓ mul <- x * y ✓ print(mod)
✓ div <- x / y
✓ exp <- x ^ y
2. Relational Operators
These operators are used to compare two values.
The result of a relational operation is always TRUE or FALSE.
Operator Description Example Result
> Greater than 5>3 TRUE
< Less than 5<3 FALSE
== Equal to 5 == 5 TRUE
!= Not equal to 5 != 3 TRUE
>= Greater than or equal to 5 >= 3 TRUE
<= Less than or equal to 5 <= 3 FALSE
Example Code:
# Relational Operators print(a > b)
a <- 5 print(a == b)
b <- 3 print(a != b)
3. Logical Operators
Logical operators are used to combine or test logical (TRUE/FALSE) values.
Operator Description Example Result
& Element-wise AND TRUE & FALSE
FALSE
` ` Element-wise `TRUE
OR
! NOT operator !TRUE FALSE
&& Logical AND (checks first element (5>3) && (2>1) TRUE
only)
` ` Logical OR (checks first element
only)
4. Assignment Operators
Assignment operators are used to assign values to variables.
Operator Description Example

<- Assigns value to a variable (most common in R) x <- 10


-> Assigns value to a variable (rightward assignment) 10 -> x
= Also used to assign a value x = 10
Example Code:
# Assignment Operators print(x)
x <- 20 print(y)
y = 30 print(z)
40 -> z
5. Miscellaneous Operators
These are special-purpose operators used for specific data handling tasks.
Operator Description Example Result
: Sequence operator 1:5 12345
%in% Membership operator 2 %in% c(1,2,3) TRUE
%*% Matrix multiplication matrix(1:4,2,2) %*% matrix(1:4,2,2) Matrix result
Example Code:
# Miscellaneous Operators m1 <- matrix(1:4, nrow = 2)
print(1:5) m2 <- matrix(1:4, nrow = 2)
print(2 %in% c(1, 2, 3)) print(m1 %*% m2)

SUMMARY TABLE
Type of Operator Purpose Example
Arithmetic Operators Perform mathematical operations 10 + 5, 10 %% 3
Relational Operators Compare values 5 > 3, 5 == 5
Logical Operators Combine or test conditions TRUE & FALSE, !TRUE
Assignment Operators Assign values to variables x <- 10, x = 20
Miscellaneous Operators Special tasks like sequence or matrix operations 1:5, %in%, %*%
MAIN FUNCTIONS IN R STUDIO
Meaning
Functions in R are predefined sets of instructions that perform specific tasks such as mathematical
operations, data manipulation, or analysis.
A function in R usually has the format:
function_name(arguments)
For example:
sum(10, 20)
R also allows users to create their own custom functions using the function() keyword.
1. Mathematical Functions
These functions perform mathematical and arithmetic operations on numeric data.
Common Mathematical Functions:
Function Description Example Code Result
sum() Adds all values sum(10, 20, 30) 60
mean() Calculates average mean(c(10, 20, 30)) 20
max() Returns maximum value max(c(5, 10, 15)) 15
min() Returns minimum value min(c(5, 10, 15)) 5
sqrt() Square root sqrt(16) 4
abs() Absolute value abs(-10) 10
2. Statistical Functions
These functions are used to perform basic statistical analysis.
Function Description Example Code Result
median() Finds the middle value median(c(2, 4, 6, 8, 10)) 6
sd() Standard deviation sd(c(2, 4, 6, 8, 10)) 2.828
var() Variance var(c(2, 4, 6, 8, 10)) 8
range() Minimum and maximum values range(c(1, 3, 5, 7)) 17
quantile() Returns quantiles quantile(c(1, 2, 3, 4, 5)) 0%,25%,50%,75%,100%

Example Code: sd(data)


# Statistical Functions var(data)
data <- c(2, 4, 6, 8, 10) range(data)
median(data) quantile(data)
3. Character Functions
These functions are used to handle and manipulate text (character) data.
Function Description Example Code Result
nchar() Counts number of characters nchar("R Language") 10
toupper() Converts to uppercase toupper("r language") "R LANGUAGE"
tolower() Converts to lowercase tolower("R LANGUAGE") "r language"
substr() Extracts substring substr("RStudio", 1, 3) "RSt"
paste() Combines strings paste("R", "Studio") "R Studio"
Example Code:
✓ # Character Functions ✓ tolower(name)
✓ name <- "R Language" ✓ substr(name, 1, 4)
✓ nchar(name) ✓ paste("R", "Studio")
✓ toupper(name)
4. Sequence and Repetition Functions
Used to generate numeric sequences or repeat elements.
Function Description Example Code Result
seq() Creates a sequence seq(1, 10, by=2) 1 3 5 7 9
rep() Repeats values rep(5, times=4) 5555
Example Code:
# Sequence and Repetition Functions
seq(1, 10, by = 2)
rep(5, times = 4)
5. Data Manipulation Functions
These functions are used to handle data objects like vectors, lists, and data frames.
Function Description Example Code
length() Returns number of elements length(c(1, 2, 3, 4))
sort() Sorts data in order sort(c(5, 2, 8, 1))
unique() Removes duplicates unique(c(1, 2, 2, 3, 3, 4))
append() Adds elements to vector append(c(1,2,3), 4)
rev() Reverses order rev(c(1,2,3,4))

Example Code: ✓ sort(x)


✓ unique(x)
✓ # Data Manipulation Functions
✓ append(x, 10)
✓ x <- c(5, 2, 8, 2, 1)
✓ rev(x)
✓ length(x)
6. User-Defined Function
R allows you to create your own custom functions using the function() keyword.
Example Code:
# User-defined function
add_numbers <- function(a, b) {
result <- a + b
return(result)
}
add_numbers(10, 20) # Output: 30
Explanation:
This function named add_numbers() takes two inputs, adds them, and returns the sum.
SUMMARY TABLE
Category Examples of Functions Purpose
Mathematical sum(), mean(), sqrt(), abs() Perform calculations
Statistical median(), sd(), var(), range() Analyze data
Character nchar(), toupper(), paste() Handle text data
Sequence/Repetition seq(), rep() Create series of numbers
Data Manipulation length(), sort(), unique() Modify data objects
User-Defined function() Create custom functions
DATA IMPORT AND EXPORT IN R
Meaning
Data import and export are processes of bringing external data into R for analysis and saving
processed data from R into external files.
R can handle multiple file formats like CSV, Excel, TXT, and RData.
1. IMPORTING DATA INTO R
a) Importing CSV Files
CSV (Comma Separated Values) is one of the most common formats.
R Code:
# Importing CSV file
data <- read.csv("C:/Users/Dharun/Documents/data.csv", header = TRUE)
head(data) # Displays first 6 rows
Explanation:
• "header = TRUE" means the first row contains column names.
• head() shows the first few rows to verify data.
b) Importing TXT Files
TXT files can also be imported as a table.
R Code:
# Importing TXT file
data_txt <- read.table("C:/Users/Dharun/Documents/data.txt", header = TRUE, sep = "\t")
head(data_txt)
Explanation:
• sep = "\t" specifies tab-separated values.
• Can be changed to "," for CSV-like text files.
c) Importing Excel Files
R can import Excel files using packages like readxl.
R Code:
# Install and load package
install.packages("readxl")
library(readxl)
# Import Excel file
data_excel <- read_excel("C:/Users/Dharun/Documents/data.xlsx", sheet = 1)
head(data_excel)
Explanation:
• sheet = 1 selects the first sheet in the Excel file.
d) Importing RData or RDS Files
R native file formats for saving R objects.
R Code:
# Load RData file
load("C:/Users/Dharun/Documents/data.RData")
# Load RDS file
my_data <- readRDS("C:/Users/Dharun/Documents/data.rds")
2. EXPORTING DATA FROM R
a) Export to CSV
You can save a data frame into a CSV file.
R Code:
# Export data frame to CSV
write.csv(data, "C:/Users/Dharun/Documents/output.csv", row.names = FALSE)
Explanation:
• row.names = FALSE prevents adding row numbers as a separate column.
b) Export to TXT
Data can be saved as a TXT file.
R Code:
# Export data frame to TXT
write.table(data, "C:/Users/Dharun/Documents/output.txt", sep = "\t", row.names = FALSE)
c) Export to Excel
R can save data to Excel using the writexl package.
R Code:
# Install and load package
install.packages("writexl")
library(writexl)
# Export to Excel
write_xlsx(data, "C:/Users/Dharun/Documents/output.xlsx")
d) Save R Objects
Save R objects in native format for later use.
R Code:
# Save data frame in RData format
save(data, file = "C:/Users/Dharun/Documents/data.RData")
# Save as RDS file
saveRDS(data, file = "C:/Users/Dharun/Documents/data.rds")
Explanation:
• save() can store multiple objects together.
• saveRDS() saves a single object that can be read using readRDS().
SUMMARY TABLE
Operation Function File Type Example
Import CSV read.csv() CSV data <- read.csv("file.csv")
Import TXT read.table() TXT data <- read.table("file.txt", sep="\t")
Import Excel read_excel() XLS/XLSX data_excel <- read_excel("file.xlsx")
Export CSV write.csv() CSV write.csv(data, "output.csv")
Export TXT write.table() TXT write.table(data, "output.txt")
Export Excel write_xlsx() XLSX write_xlsx(data, "output.xlsx")
Save R Object save() / saveRDS() RData / RDS save(data, file="data.RData")
DATA PREPROCESSING IN R
Meaning
✓ Data preprocessing is the process of cleaning, transforming, and preparing raw data for
analysis.
It ensures that the data is accurate, consistent, and suitable for statistical analysis, machine
learning, or visualization.
✓ Raw data often contains missing values, duplicates, noise, or inconsistent formats, which can
affect the quality of analysis.
Preprocessing is therefore a critical step in data analytics.
Main Steps in Data Preprocessing
1. Handling Missing Values
Missing values can cause errors in analysis and must be addressed.
Methods include removing, replacing, or imputing missing values.
R Code:
# Detect missing values
sum(is.na(data))
# Remove rows with missing values
data_clean <- na.omit(data)
# Replace missing values with mean (for numeric columns)
data$Age[is.na(data$Age)] <- mean(data$Age, na.rm = TRUE)
2. Removing Duplicates
Duplicate records can skew analysis and need to be removed.
R Code:
# Remove duplicate rows
data_unique <- data[!duplicated(data), ]
3. Data Transformation
Data may need to be transformed to improve analysis.
Common transformations include normalization, scaling, and converting data types.
R Code:
# Normalize numeric data (Min-Max scaling)
data$Salary <- (data$Salary - min(data$Salary)) / (max(data$Salary) - min(data$Salary))
# Convert character to factor
data$Gender <- as.factor(data$Gender)
4. Handling Outliers
Outliers can distort statistical analysis.
They can be detected using boxplots or statistical methods and handled by removal or transformation.
R Code:
# Detect outliers using boxplot
boxplot(data$Salary)
# Remove outliers outside 1.5*IQR
Q1 <- quantile(data$Salary, 0.25)
Q3 <- quantile(data$Salary, 0.75)
IQR <- Q3 - Q1
data_no_outliers <- data[data$Salary >= (Q1 - 1.5*IQR) & data$Salary <= (Q3 + 1.5*IQR), ]
5. Encoding Categorical Variables
Machine learning algorithms require numeric inputs.
Categorical variables can be encoded as numbers.
R Code:
# One-hot encoding using model.matrix
encoded_data <- model.matrix(~ Gender - 1, data = data)
6. Feature Selection
Choose the most important variables for analysis to improve model performance.
R Code:
# Select specific columns
data_selected <- data[, c("Age", "Salary", "Department")]
Summary Table
Step Purpose R Function / Code
Handle Missing Values Remove or impute missing is.na(), na.omit(), replace with
data mean/median
Remove Duplicates Avoid redundancy !duplicated()
Data Transformation Scale or normalize data (x - min(x)) / (max(x) - min(x)),
as.factor()
Handle Outliers Remove extreme values boxplot(), IQR method
Encode Categorical Convert text to numeric model.matrix(), as.factor()
Variables
Feature Selection Select important variables Subset columns data[, c(...)]
Purpose of Data Preprocessing
• Ensures accuracy and consistency of data.
• Improves quality of analysis and models.
• Reduces errors and biases caused by missing or noisy data.
• Converts raw data into a usable format for R analysis or machine learning.
UNIT – 2
Meaning of Data Wrangling
Data wrangling (also called data munging) is the process of cleaning, transforming, and organizing raw
data into a structured and usable format for analysis.
Raw data is often messy, incomplete, inconsistent, or unstructured, which makes it difficult to analyze
directly. Data wrangling ensures that the data is accurate, consistent, and ready for statistical analysis or
machine learning.
Key Steps in Data Wrangling
1. Data Cleaning – Handling missing values, duplicates, and errors.
2. Data Transformation – Converting data types, normalizing, or scaling values.
3. Data Integration – Combining multiple datasets into one consistent dataset.
4. Data Enrichment – Adding derived variables or new features for analysis.
5. Data Validation – Ensuring data quality and correctness.
Purpose of Data Wrangling
• Makes raw data usable for analysis.
• Reduces errors and inconsistencies in datasets.
• Prepares data for visualization, reporting, or machine learning models.
MEANING OF dplyr AND tidyr
1. dplyr
• dplyr is an R package designed for data manipulation and transformation.
• It provides simple and readable functions to filter, select, arrange, summarize, and mutate data
frames.
• Main use: cleaning, summarizing, and reshaping datasets for analysis.
Key Features:
• Filter rows (filter())
• Select columns (select())
• Arrange/sort data (arrange())
• Add new variables (mutate())
• Summarize data (summarise() + group_by())
2. tidyr
• tidyr is an R package designed for tidying and reshaping data.
• It helps convert messy data into tidy format, where each variable is a column and each observation
is a row.
Key Features:
• Pivot longer/wider (pivot_longer(), pivot_wider())
• Separate or unite columns (separate(), unite())
• Fill missing values (fill())
HOW TO USE dplyr AND tidyr IN R STUDIO
Step 1: Install and Load Packages
# Install packages (run only once)
install.packages("dplyr")
install.packages("tidyr")
# Load packages
library(dplyr)
library(tidyr)
Step 2: Using dplyr for Data Cleaning
a) Filter Rows
# Filter rows where Age > 25
clean_data <- filter(data, Age > 25)
b) Select Columns
# Select only Name and Salary columns
clean_data <- select(data, Name, Salary)
c) Arrange / Sort Data
# Sort data by Salary in descending order
clean_data <- arrange(data, desc(Salary))
d) Create or Modify Columns
# Add a new column Bonus = 10% of Salary
clean_data <- mutate(data, Bonus = Salary * 0.1)
e) Summarize Data
# Calculate average Salary by Department
summary_data <- data %>%
group_by(Department) %>%
summarise(Average_Salary = mean(Salary, na.rm = TRUE))
Step 3: Using tidyr for Data Cleaning
a) Pivot Longer (Convert wide to long format)
# Convert wide data to long format
long_data <- pivot_longer(data, cols = c(Month1, Month2, Month3), names_to = "Month", values_to
= "Sales")
b) Pivot Wider (Convert long to wide format)
# Convert long data to wide format
wide_data <- pivot_wider(long_data, names_from = Month, values_from = Sales)
c) Separate a Column
# Split 'FullName' into 'FirstName' and 'LastName'
data <- separate(data, col = FullName, into = c("FirstName", "LastName"), sep = " ")
d) Unite Columns
# Combine 'City' and 'State' into 'Location'
data <- unite(data, col = "Location", City, State, sep = ", ")
e) Fill Missing Values
# Fill missing values in Sales column with previous value
data <- fill(data, Sales, .direction = "down")
Summary Table
Package Function Purpose
dplyr filter(), select(), arrange(), mutate(), summarise() Manipulate and clean data frames
tidyr pivot_longer(), pivot_wider(), separate(), unite(), fill() Tidy and reshape messy data
DATA TRANSFORMATION USING dplyr AND tidyr
Meaning
Data transformation involves modifying or reshaping your dataset to make it suitable for analysis.
Using dplyr and tidyr, you can:
• Create new variables
• Summarize data
• Reshape data from wide to long (or vice versa)
• Handle missing or inconsistent values
1. Transformation Using dplyr
a) Creating or Modifying Columns (mutate)
# Add a new column Total_Score as sum of Math and Science
data <- data %>%
mutate(Total_Score = Math + Science,
Average_Score = (Math + Science)/2)
b) Filtering Rows (filter)
# Keep only students with Total_Score greater than 150
data_transformed <- data %>%
filter(Total_Score > 150)
c) Summarizing Data (summarise)
# Average score by class
summary_data <- data %>%
group_by(Class) %>%
summarise(Average_Math = mean(Math, na.rm = TRUE),
Average_Science = mean(Science, na.rm = TRUE))
d) Selecting and Renaming Columns (select)
# Select only Name, Total_Score and rename columns
data_transformed <- data %>%
select(Student_Name = Name, Total_Score, Average_Score)
2. Transformation Using tidyr
a) Pivot Longer (Wide → Long)
# Convert wide scores to long format
long_data <- data %>%
pivot_longer(cols = c(Math, Science),
names_to = "Subject",
values_to = "Score")
Explanation:
• Each subject now becomes a row under the column Subject.
• Scores are stored under the column Score.
b) Pivot Wider (Long → Wide)
# Convert long data back to wide format
wide_data <- long_data %>%
pivot_wider(names_from = Subject, values_from = Score)
Explanation:
• Reverts long format to original wide format, with one column per subject.
c) Separate Columns
# Split Student_Name into First_Name and Last_Name
data <- data %>%
separate(col = Student_Name, into = c("First_Name", "Last_Name"), sep = " ")
d) Unite Columns
# Combine Class and Section into Class_Section
data <- data %>%
unite(col = "Class_Section", Class, Section, sep = "-")
e) Fill Missing Values
# Fill missing Score values downward
long_data <- long_data %>%
fill(Score, .direction = "down")
Summary Table of Transformations
Package Function Purpose
dplyr mutate() Add or modify columns
dplyr filter() Keep rows that meet conditions
dplyr summarise() Aggregate data by groups
dplyr select() Select or rename columns
tidyr pivot_longer() Convert wide to long format
tidyr pivot_wider() Convert long to wide format
tidyr separate() Split a column into multiple columns
tidyr unite() Combine multiple columns into one
tidyr fill() Fill missing values
STRING MANIPULATION USING stringr
Meaning
The stringr package in R is designed for handling and manipulating strings (text data) in a simple
and consistent way.
It provides a set of easy-to-use functions to clean, extract, modify, and analyze text.
Step 1: Install and Load stringr
# Install stringr package (if not already installed)
install.packages("stringr")
# Load package
library(stringr)
Step 2: Common String Manipulation Functions
1. Detecting Patterns (str_detect)
# Check if "R" is present in each string
text <- c("I love R", "Python is great", "Data Science")
str_detect(text, "R")
# Output: TRUE FALSE FALSE
2. Counting Occurrences (str_count)
# Count number of "a" in each string
str_count(text, "a")
3. Extracting Substrings (str_sub)
# Extract first 5 characters from each string
str_sub(text, 1, 5)
# Output: "I lov" "Python" "Data "
4. Replacing Text (str_replace / str_replace_all)
# Replace first occurrence of "R" with "Python"
str_replace(text, "R", "Python")
# Replace all occurrences of "a" with "@"
str_replace_all(text, "a", "@")
5. Splitting Strings (str_split)
# Split strings by space
str_split(text, " ")
# Output: List of words in each sentence
6. Combining Strings (str_c)
# Combine two strings with a separator
str_c("Hello", "World", sep = " - ")
# Output: "Hello - World"
7. Trimming Whitespace (str_trim)
# Remove leading and trailing spaces
text2 <- c(" R Programming ", " Data Science ")
str_trim(text2)
# Output: "R Programming" "Data Science"
8. Changing Case
# Convert to uppercase
str_to_upper(text)
# Convert to lowercase
str_to_lower(text)
# Convert to title case
str_to_title(text)
Summary Table of Useful stringr Functions
Function Purpose Example
str_detect() Check if pattern exists str_detect(text, "R")
str_count() Count occurrences str_count(text, "a")
str_sub() Extract substring str_sub(text, 1, 5)
str_replace() Replace first match str_replace(text, "R", "Python")
str_replace_all() Replace all matches str_replace_all(text, "a", "@")
str_split() Split string by delimiter str_split(text, " ")
str_c() Concatenate strings str_c("Hello", "World", sep="-")
str_trim() Remove extra spaces str_trim(text2)
str_to_upper()/str_to_lower()/str_to_title() Change case str_to_upper(text)
DATA VISUALIZATION IN R
Meaning
Data visualization in R is the process of representing data graphically using charts, plots, and graphs to
identify patterns, trends, and insights.
It helps convert raw data into a visual format that is easier to understand and interpret.
R provides various packages such as ggplot2, lattice, and plotly to create high-quality visualizations.
Features of Data Visualization in R
1. Wide Range of Plots: Supports bar charts, line charts, scatter plots, histograms, boxplots, heatmaps,
and more.
2. High Customization: You can change colors, labels, titles, axes, themes, and styles.
3. Interactive Visualization: Packages like plotly allow zooming, hovering, and interactive
dashboards.
4. Integration with Data Analysis: Works seamlessly with dplyr, tidyr, and other data cleaning tools.
5. Supports Large Datasets: Can handle large and complex datasets efficiently.
Advantages of Data Visualization in R
1. Better Understanding: Makes complex data easy to interpret and analyze.
2. Pattern Identification: Helps detect trends, correlations, and outliers in data.
3. Effective Communication: Graphical representations are more understandable than raw tables.
4. Decision Making: Aids business analysts and decision-makers in taking informed actions.
5. Supports Reporting: Produces high-quality visualizations for reports and presentations.
1. ggplot2
Meaning
• ggplot2 is an R package for data visualization based on the Grammar of Graphics.
• It allows users to create complex and high-quality plots by combining layers such as data,
aesthetics, and geometric objects.
• Widely used for scatter plots, bar charts, line charts, boxplots, and histograms.
Installation
# Install ggplot2 package
install.packages("ggplot2")
# Load ggplot2
library(ggplot2)
Basic Usage
# Example dataset
data <- data.frame(
Category = c("A", "B", "C"),
Value = c(10, 20, 15)
)
# Bar plot using ggplot2
ggplot(data, aes(x = Category, y = Value)) +
geom_bar(stat = "identity", fill = "skyblue") +
ggtitle("Bar Chart Example") +
xlab("Category") + ylab("Value")
Exporting ggplot2 Plots
# Save plot as PNG
ggsave("barplot.png", width = 6, height = 4)
# Save plot as PDF
ggsave("barplot.pdf", width = 6, height = 4)
2. Lattice
Meaning
• lattice is an R package for multivariate data visualization.
• It is especially good for conditioning plots, i.e., plotting subsets of data in panels.
• Useful for scatterplots, histograms, and density plots with multiple grouping variables.
Installation
# Install lattice package
install.packages("lattice")
# Load lattice
library(lattice)
Basic Usage
# Example dataset
data <- data.frame(
x = rnorm(50),
y = rnorm(50),
group = rep(c("A", "B"), each = 25)
)
# Scatter plot with lattice
xyplot(y ~ x | group, data = data,
layout = c(2,1),
main = "Scatter Plot by Group",
xlab = "X Values", ylab = "Y Values")
Exporting Lattice Plots
# Save plot as PNG
png("lattice_plot.png", width=600, height=400)
xyplot(y ~ x | group, data = data)
dev.off()
3. highcharter
Meaning
• highcharter is an R wrapper for Highcharts JS library.
• It allows creating interactive charts and dashboards with features like tooltips, zooming, and
clickable legends.
• Useful for web-based dashboards and dynamic visualization.
Installation
# Install highcharter package
install.packages("highcharter")
# Load highcharter
library(highcharter)
Basic Usage
# Example dataset
data <- data.frame(
Category = c("A", "B", "C"),
Value = c(10, 20, 15)
)
# Create interactive column chart
highchart() %>%
hc_chart(type = "column") %>%
hc_title(text = "Highcharter Example") %>%
hc_xAxis(categories = data$Category) %>%
hc_add_series(name = "Value", data = data$Value)
Exporting highcharter Plots
# Save highcharter as HTML
library(htmlwidgets)
hc <- highchart() %>%
hc_chart(type = "column") %>%
hc_xAxis(categories = data$Category) %>%
hc_add_series(name = "Value", data = data$Value)
saveWidget(hc, "highchart_plot.html", selfcontained = TRUE)
Summary Table
Package Purpose Key Features Export Options
ggplot2 High-quality static plots Layered plots, bar, line, PNG, PDF, JPEG via
scatter ggsave()
lattice Multivariate and conditional Panels, scatter, histogram, PNG, PDF via png() / pdf()
plots density
highcharter Interactive charts Dynamic plots, tooltips, zoom HTML via saveWidget()
1. RColorBrewer
Meaning
• RColorBrewer is an R package used for color palettes in data visualization.
• It provides predefined color schemes that are visually appealing and suitable for maps, charts, and
plots.
• Useful to enhance readability and aesthetics in plots.
Installation
# Install RColorBrewer package
install.packages("RColorBrewer")
# Load package
library(RColorBrewer)
Usage Example
# Display all color palettes
display.brewer.all()
# Create a bar plot with RColorBrewer palette
library(ggplot2)
data <- data.frame(
Category = c("A", "B", "C"),
Value = c(10, 20, 15)
)
ggplot(data, aes(x = Category, y = Value, fill = Category)) +
geom_bar(stat = "identity") +
scale_fill_brewer(palette = "Set2") +
ggtitle("Bar Plot with RColorBrewer Colors")
Explanation:
• scale_fill_brewer(palette = "Set2") applies the Set2 color palette to the bars.
2. Plotly
Meaning
• Plotly is an R package for interactive plots.
• It allows users to zoom, hover, and interact with charts.
• Works well with ggplot2 to make plots interactive.
Installation
# Install plotly package
install.packages("plotly")

# Load package
library(plotly)
Basic Usage
a) Interactive Scatter Plot
data <- data.frame(
x = c(1, 2, 3, 4, 5),
y = c(10, 15, 13, 17, 20)
)
plot_ly(data, x = ~x, y = ~y, type = 'scatter', mode = 'markers') %>%
layout(title = "Interactive Scatter Plot",
xaxis = list(title = "X Values"),
yaxis = list(title = "Y Values"))
b) Interactive Bar Plot
plot_ly(data, x = ~x, y = ~y, type = 'bar', name = 'Values') %>%
layout(title = "Interactive Bar Plot")
c) Converting ggplot2 to Interactive Plotly
library(ggplot2)
library(plotly)
p <- ggplot(data, aes(x = x, y = y)) + geom_line()
ggplotly(p) # Converts ggplot2 to interactive plot
Exporting Plotly Plots
# Save plotly plot as HTML
library(htmlwidgets)
p <- plot_ly(data, x = ~x, y = ~y, type = 'scatter', mode = 'lines')
saveWidget(p, "interactive_plot.html", selfcontained = TRUE)
DATA VISUALIZATION: CHARTS, GRAPHS, AND MAPS IN R STUDIO
Meaning
• Data visualization in R Studio refers to representing data graphically to identify patterns, trends,
and insights
• It helps communicate information clearly and supports decision-making.
Charts and graphs display numerical and categorical data, while maps visualize geographical
data.
1. Charts in R Studio
Meaning
Charts are graphical representations of data, often used for categorical comparisons.
Types and Examples
a) Bar Chart
library(ggplot2)
data <- data.frame(
Category = c("A", "B", "C"),
Value = c(10, 20, 15)
)
ggplot(data, aes(x = Category, y = Value, fill = Category)) +
geom_bar(stat = "identity") +
ggtitle("Bar Chart Example")
b) Pie Chart
# Pie chart using base R
values <- c(10, 20, 15)
labels <- c("A", "B", "C")
pie(values, labels = labels, col = rainbow(length(values)), main = "Pie Chart Example")
c) Line Chart
ggplot(data, aes(x = Category, y = Value, group = 1)) +
geom_line(color = "blue") +
geom_point(color = "red") +
ggtitle("Line Chart Example")
2. Graphs in R Studio
Meaning
Graphs are used to show relationships between variables, often continuous or numeric data.
Types and Examples
a) Scatter Plot
data <- data.frame(x = c(1,2,3,4,5), y = c(5,7,6,8,9))
ggplot(data, aes(x = x, y = y)) +
geom_point(color = "darkgreen", size = 3) +
ggtitle("Scatter Plot Example")
b) Histogram
data <- data.frame(values = c(5,7,6,8,9,5,7,6,8,9))
ggplot(data, aes(x = values)) +
geom_histogram(binwidth = 1, fill = "skyblue", color = "black") +
ggtitle("Histogram Example")
c) Boxplot
data <- data.frame(
Category = rep(c("A","B","C"), each = 5),
Value = c(5,6,7,5,6,7,8,6,7,8,9,8,7,8,9)
)
ggplot(data, aes(x = Category, y = Value, fill = Category)) +
geom_boxplot() +
ggtitle("Boxplot Example")
3. Maps in R Studio
Meaning
Maps are used to visualize geographical data, such as locations, regions, or spatial patterns.
Packages
• ggplot2 + maps or mapdata for static maps
• leaflet for interactive maps
a) Static Map Example (ggplot2 + maps)
library(ggplot2)
library(maps)
world_map <- map_data("world")
ggplot(world_map, aes(x = long, y = lat, group = group)) +
geom_polygon(fill = "lightblue", color = "black") +
ggtitle("World Map Example")
b) Interactive Map Example (leaflet)
library(leaflet)
leaflet() %>%
addTiles() %>%
addMarkers(lng = 77.2090, lat = 28.6139, popup = "New Delhi")
Explanation:
• addTiles() adds the base map.
• addMarkers() places interactive markers with popups.
Summary Table
Type Purpose Example Packages / Functions
Charts Compare categories ggplot2: geom_bar(), geom_line(), pie()
Graphs Show relationships ggplot2: geom_point(), geom_histogram(), geom_boxplot()
Maps Visualize spatial data maps + ggplot2, leaflet for interactive maps
Scenario: Retail Sales Analysis for a Store
Business Context
A retail store wants to analyze its monthly sales performance to understand:
1. Which products are performing well.
2. Which months have high or low sales.
3. The distribution of sales across different regions.
The store has a dataset containing:
Month Product Units_Sold Revenue Region

Jan A 120 2400 North


Jan B 80 1600 South
Feb A 150 3000 North
Feb B 90 1800 South
… … … … …
Objectives
1. Identify top-selling products.
2. Track monthly revenue trends.
3. Compare regional sales performance.
Suggested Visualizations in R
1. Bar Chart – Product Sales
library(ggplot2)
ggplot(data, aes(x = Product, y = Units_Sold, fill = Product)) +
geom_bar(stat = "identity") +
ggtitle("Units Sold by Product") +
xlab("Product") + ylab("Units Sold")
Purpose: Quickly identify which product sells the most.
2. Line Chart – Monthly Revenue Trend
ggplot(data, aes(x = Month, y = Revenue, group = 1)) +
geom_line(color = "blue") +
geom_point(color = "red") +
ggtitle("Monthly Revenue Trend") +
xlab("Month") + ylab("Revenue")
Purpose: Track revenue growth or decline over months.
3. Pie Chart – Regional Revenue Share
library(dplyr)
region_data <- data %>%
group_by(Region) %>%
summarise(Total_Revenue = sum(Revenue))
pie(region_data$Total_Revenue,
labels = region_data$Region,
col = rainbow(length(region_data$Region)),
main = "Revenue Share by Region")
Purpose: Show proportion of revenue contributed by each region.
4. Boxplot – Revenue Distribution by Product
ggplot(data, aes(x = Product, y = Revenue, fill = Product)) +
geom_boxplot() +
ggtitle("Revenue Distribution by Product")
Purpose: Identify variability in revenue and potential outliers.
5. Interactive Scatter Plot – Units vs Revenue (Plotly)
library(plotly)
plot_ly(data, x = ~Units_Sold, y = ~Revenue, color = ~Product, type = 'scatter', mode = 'markers')
%>%
layout(title = "Units Sold vs Revenue")
Purpose: Explore relationship between units sold and revenue interactively.
Scenario Summary
• Bar Chart: Best-selling products
• Line Chart: Revenue trends over time
• Pie Chart: Regional revenue distribution
• Boxplot: Revenue variability by product
• Interactive Scatter Plot: Units sold vs revenue
UNIT – 3
1. Descriptive Statistics
Meaning
Descriptive statistics is the branch of statistics that summarizes, organizes, and describes data in a
meaningful way.
It provides numerical and graphical summaries of datasets to understand patterns, trends, and
distributions.
Key Points:
• Focuses on existing data without making predictions.
• Helps in understanding the central tendency, variability, and distribution.
• Often used as the first step in data analysis before applying advanced statistical or predictive
methods.
Common Descriptive Statistics Measures
Measure Type Purpose Example
Central Tendency Shows the “average” or typical value Mean, Median, Mode
Dispersion / Variability Shows spread or variation in data Range, Variance, Standard Deviation
Distribution Shape Shows data distribution Skewness, Kurtosis
Frequency Shows count of occurrences Frequency tables, Percentages
Example in R:
data <- c(10, 15, 20, 25, 30) sd(data) # Standard Deviation
mean(data) # Mean summary(data) # Min, 1st Qu., Median, Mean,
3rd Qu., Max
median(data) # Median
var(data) # Variance
2. Data Summarization
Meaning
Data summarization is the process of condensing large datasets into a simpler form to make analysis
easier and faster.
It helps to extract key insights and identify patterns without examining every data point.
Key Points:
• Often done using tables, charts, or aggregated statistics.
• Includes grouping, aggregation, and generating summaries.
• Makes data readable and interpretable for reporting or decision-making.
Common Data Summarization Techniques in R
Technique Purpose Example
Aggregation Summarize numeric data by groups aggregate(Revenue ~ Region, data, sum)
Grouping Create subsets for comparison group_by(data, Product) (dplyr)
Summary Functions Compute statistics mean(), median(), sum(), min(), max()
Frequency Tables Count occurrences of categories table(data$Category)
Example in R using dplyr:
library(dplyr)
data <- data.frame(
Product = c("A", "B", "A", "C", "B"),
Revenue = c(100, 200, 150, 300, 250)
)
# Summarize total revenue by product
data_summary <- data %>%
group_by(Product) %>%
summarise(Total_Revenue = sum(Revenue),
Average_Revenue = mean(Revenue))
TESTING ASSUMPTIONS
Meaning
Testing assumptions refers to the process of checking whether the data meets the requirements
necessary for applying a particular statistical method or model.
Most statistical tests and models, such as t-tests, ANOVA, regression, and parametric tests, require certain
assumptions to be satisfied.
If the assumptions are violated, the results may be invalid or misleading.
Common Assumptions in Statistical Analysis
Assumption Explanation Importance

Normality Data should follow a normal (bell-shaped) Required for t-tests, ANOVA,
distribution regression
Homogeneity of Variances across groups should be equal Required for ANOVA,
Variance regression
Independence Observations should be independent of each Required for most tests to
other avoid bias
Linearity Relationship between independent and Required for correlation and
dependent variable should be linear regression
No Multicollinearity Independent variables should not be highly Required for multiple
correlated regression
Random Sampling Data should be randomly selected Ensures generalizability of
results
How to Test Assumptions in R
1. Normality Test
• Shapiro-Wilk Test
shapiro.test(data$Variable)
• Interpretation:
o p-value > 0.05 → data is normally distributed
o p-value < 0.05 → data is not normally distributed
• Visual Check:
hist(data$Variable)
qqnorm(data$Variable)
qqline(data$Variable)
2. Homogeneity of Variance
• Levene’s Test (from car package)
library(car)
leveneTest(Variable ~ Group, data = data)
• p-value > 0.05 → equal variances
• p-value < 0.05 → variances are unequal
3. Linearity
• Scatter Plot
plot(data$X, data$Y)
abline(lm(Y ~ X, data = data), col="red")
• The plot should show a roughly straight-line relationship.
4. Multicollinearity
• Variance Inflation Factor (VIF)
library(car)
model <- lm(Y ~ X1 + X2 + X3, data = data)
vif(model)
• VIF > 10 indicates high multicollinearity.
1. Parametric Tests
Meaning
• Parametric tests are statistical tests that make assumptions about the population parameters
(e.g., mean, variance) and the underlying distribution of the data.
• Typically, they assume that the data is normally distributed and have homogeneity of variance.
• They are generally more powerful than non-parametric tests if assumptions are satisfied.
Common Parametric Tests in R Studio
Test Purpose R Function Example
t-test Compare means of two groups t.test(x ~ group, data =
data)
ANOVA (Analysis of Compare means of more than two groups aov(Y ~ Group, data =
Variance) data)
Pearson Correlation Measure linear relationship between two cor.test(x, y, method =
variables "pearson")
Linear Regression Model relationship between dependent and lm(Y ~ X, data = data)
independent variables
2. Correlation in R
Meaning
• Correlation measures the strength and direction of a linear relationship between two numeric
variables.
• Values range from -1 to +1:
o +1 → perfect positive correlation
o -1 → perfect negative correlation
o 0 → no correlation
R Example
# Example dataset
data <- data.frame(
Sales = c(200, 250, 300, 350, 400),
Advertising = c(50, 60, 65, 70, 80)
)
# Pearson correlation
cor.test(data$Sales, data$Advertising, method = "pearson")
3. Regression in R
Meaning
• Regression is used to model the relationship between a dependent variable (Y) and one or more
independent variables (X).
• Simple Linear Regression: One independent variable
• Multiple Linear Regression: Two or more independent variables
Simple Linear Regression Example
# Linear regression model
model <- lm(Sales ~ Advertising, data = data)
# View summary of the model
summary(model)
Interpretation:
• Coefficients → effect of independent variable on dependent variable
• R-squared → proportion of variance explained by the model
• p-value → significance of the predictor
Multiple Linear Regression Example
data$Price <- c(10, 12, 11, 13, 14)
# Multiple regression
model2 <- lm(Sales ~ Advertising + Price, data = data)
summary(model2)
Summary Table
Test / Method Purpose R Function
t-test Compare means of two groups t.test()
ANOVA Compare means of more than two groups aov()
Pearson Correlation Measure linear relationship cor.test(method="pearson")
Simple Linear Regression Model dependent ~ independent lm(Y ~ X)
Multiple Linear Regression Model dependent ~ multiple independents lm(Y ~ X1 + X2 + ...)
1. Independent Sample t-test
Purpose:
Compare the means of two independent groups to see if they are significantly different.
Example Dataset
# Sample data
data <- data.frame(
Group = rep(c("A", "B"), each = 5),
Score = c(85, 88, 90, 87, 86, 78, 80, 82, 79, 81)
)
# View data
data
t-test in R
t.test(Score ~ Group, data = data)
Interpretation:
• p-value < 0.05 → Significant difference between Group A and B
• p-value > 0.05 → No significant difference
2. One-Way ANOVA
Purpose:
Compare means of more than two groups to test if at least one group mean is different.
Example Dataset
# Sample data
data_anova <- data.frame(
Group = rep(c("A", "B", "C"), each = 5),
Score = c(85, 88, 90, 87, 86, 78, 80, 82, 79, 81, 92, 94, 91, 93, 95)
)
# View data
data_anova
ANOVA in R
anova_result <- aov(Score ~ Group, data = data_anova)
summary(anova_result)
Interpretation:
• p-value < 0.05 → At least one group mean is significantly different
• p-value > 0.05 → No significant difference among groups
3. Pearson Correlation
Purpose:
Measure the linear relationship between two numeric variables.
Example Dataset
# Sample data
data_corr <- data.frame(
Sales = c(200, 250, 300, 350, 400),
Advertising = c(50, 60, 65, 70, 80)
)
# View data
data_corr
Correlation in R
cor.test(data_corr$Sales, data_corr$Advertising, method = "pearson")
Interpretation:
• Correlation coefficient (r) indicates strength and direction:
o Positive → both increase together
o Negative → one increases, other decreases
• p-value < 0.05 → Significant correlation
Summary Table
Test Purpose R Function
t-test Compare means of 2 groups t.test()
ANOVA Compare means of 3+ groups aov()
Pearson Correlation Measure linear relationship cor.test(method="pearson")
1. Linear Regression
Meaning
• Linear regression is used to model the relationship between a dependent variable (Y) and one or
more independent variables (X), assuming a linear relationship.
• It predicts a continuous outcome based on predictor variables.
Assumptions
1. Linearity: Y and X have a linear relationship.
2. Independence of errors.
3. Homoscedasticity: Constant variance of residuals.
4. Normality: Residuals are normally distributed.
Example in R (Simple Linear Regression)
# Sample dataset
data <- data.frame(
Advertising = c(50, 60, 65, 70, 80),
Sales = c(200, 250, 300, 350, 400)
)
# Linear regression model
model <- lm(Sales ~ Advertising, data = data)
# View summary
summary(model)
Interpretation:
• Coefficients → Effect of Advertising on Sales
• R-squared → Proportion of variance explained
• p-value → Significance of predictor
Multiple Linear Regression
data$Price <- c(10, 12, 11, 13, 14)
# Multiple regression
model2 <- lm(Sales ~ Advertising + Price, data = data)
summary(model2)
• Predicts Sales based on multiple predictors (Advertising & Price).
2. Logistic Regression
Meaning
• Logistic regression is used to model the relationship between a binary categorical dependent
variable (Y) and independent variables (X).
• Outcome is 0 or 1 (e.g., Yes/No, Success/Failure).
• Predicts probability of occurrence using the logistic function.
Assumptions
1. Dependent variable is binary.
2. Independent variables can be continuous or categorical.
3. No multicollinearity among predictors.
4. Observations are independent.
Example in R (Binary Logistic Regression)
# Sample dataset
data <- data.frame(
Hours_Studied = c(2, 3, 5, 7, 1, 4),
Pass = c(0, 0, 1, 1, 0, 1) # 0=Fail, 1=Pass
)
# Logistic regression model
model_log <- glm(Pass ~ Hours_Studied, data = data, family = binomial)
# View summary
summary(model_log)
Prediction
# Predict probability of passing
predict(model_log, newdata = data.frame(Hours_Studied = c(6, 2)), type = "response")
• Output: Probability of passing for given hours studied.
1. Meaning of Data Cleaning and Preprocessing
Data Cleaning
• Data cleaning is the process of identifying and correcting errors, inconsistencies, or missing
values in the dataset.
• Ensures accuracy, completeness, and reliability of data before analysis.
Data Preprocessing
• Data preprocessing involves transforming raw data into a structured and analyzable format.
• Steps include handling missing values, removing duplicates, standardizing data, and converting
data types.
• Essential for improving the quality of analysis and machine learning models.
2. Common Steps in Data Cleaning & Preprocessing in R
Step 1: Inspecting the Data
# Load dataset
data <- read.csv("data.csv")
# View first few rows
head(data)
# Structure of dataset
str(data)
# Summary statistics
summary(data)
Step 2: Handling Missing Values
# Identify missing values
is.na(data)
# Remove rows with NA
data_clean <- na.omit(data)
# Replace missing values with mean (numeric column)
data$Age[is.na(data$Age)] <- mean(data$Age, na.rm = TRUE)
# Replace missing values with median
data$Salary[is.na(data$Salary)] <- median(data$Salary, na.rm = TRUE)
Step 3: Removing Duplicates
# Find duplicates
duplicated(data)
# Remove duplicate rows
data <- data[!duplicated(data), ]
Step 4: Correcting Data Types
# Convert to factor
data$Gender <- as.factor(data$Gender)
# Convert to numeric
data$Income <- as.numeric(data$Income)
# Convert to date
data$Date <- as.Date(data$Date, format="%Y-%m-%d")
Step 5: Renaming Columns
# Rename columns
colnames(data) <- c("ID", "Name", "Age", "Salary", "Gender")
Step 6: Handling Outliers
# Identify outliers using boxplot
boxplot(data$Salary)
# Replace outliers with median
data$Salary[data$Salary > 100000] <- median(data$Salary)
Step 7: Standardizing / Scaling Data
# Min-Max Normalization
data$Age <- (data$Age - min(data$Age)) / (max(data$Age) - min(data$Age))
# Z-score Standardization
data$Salary <- scale(data$Salary)
Step 8: String Cleaning (Removing Whitespaces, Special Characters)
library(stringr)
# Trim whitespaces
data$Name <- str_trim(data$Name)
# Remove special characters
data$Name <- str_replace_all(data$Name, "[^[:alnum:]]", "")
Step 9: Aggregation / Grouping (Optional)
library(dplyr)
# Group by Gender and summarize average Salary
data_summary <- data %>%
group_by(Gender) %>%
summarise(Average_Salary = mean(Salary, na.rm = TRUE))
Common Packages Used
• dplyr → Data manipulation, grouping, summarization
• tidyr → Reshaping and cleaning data (gather, spread)
• stringr → String manipulation and cleaning
• lubridate → Handling date and time
• janitor → Cleaning column names and data frames
Summary Table of Common Syntax
Task R Syntax / Function Purpose
Missing values is.na(), na.omit(), replace() Detect and handle NA
Duplicates duplicated(), !duplicated() Remove duplicate rows
Data types as.factor(), as.numeric(), as.Date() Correct column types
Outliers boxplot(), conditional replacement Identify and correct extreme values
Scaling scale(), normalization formula Standardize numeric data
String cleaning str_trim(), str_replace_all() Clean textual data
Aggregation group_by() %>% summarise() Summarize data by groups
1. Handling Missing Data
Meaning
• Missing data occurs when some values are not recorded or unavailable in the dataset.
• Handling missing data is crucial because it can bias results or affect model accuracy.
Common Methods to Handle Missing Data
a) Identify Missing Values
# Check for missing values in dataset
is.na(data)
# Count missing values per column
colSums(is.na(data))
b) Remove Missing Values
# Remove rows with any NA
data_clean <- na.omit(data)
c) Replace Missing Values
• Replace with Mean (numeric data)
data$Age[is.na(data$Age)] <- mean(data$Age, na.rm = TRUE)
• Replace with Median
data$Salary[is.na(data$Salary)] <- median(data$Salary, na.rm = TRUE)
• Replace with Mode (categorical data)
mode_value <- names(sort(table(data$Gender), decreasing=TRUE))[1]
data$Gender[is.na(data$Gender)] <- mode_value
d) Advanced Imputation (Optional)
library(mice)
# Multiple imputation for missing values
imputed_data <- mice(data, m=5, method='pmm', seed=123)
data_complete <- complete(imputed_data)
2. Outlier Detection
Meaning
• Outliers are data points significantly different from other observations.
• Can skew results and affect model performance.
Common Methods to Detect Outliers
a) Boxplot Method
# Visual detection
boxplot(data$Salary, main="Boxplot for Salary")
# Identify outlier values
outliers <- boxplot.stats(data$Salary)$out
outliers
b) Z-Score Method
# Calculate Z-scores
z_scores <- scale(data$Salary)
# Identify outliers (absolute Z-score > 3)
outliers <- data$Salary[abs(z_scores) > 3]
outliers
c) IQR (Interquartile Range) Method
Q1 <- quantile(data$Salary, 0.25)
Q3 <- quantile(data$Salary, 0.75)
IQR <- Q3 - Q1
# Detect outliers
outliers <- data$Salary[data$Salary < (Q1 - 1.5*IQR) | data$Salary > (Q3 + 1.5*IQR)]
outliers
Handling Outliers
1. Remove Outliers
data_no_outliers <- data[!(data$Salary %in% outliers), ]
2. Replace Outliers with Median
median_value <- median(data$Salary)
data$Salary[data$Salary %in% outliers] <- median_value
3. Transform Data
• Apply logarithmic or square root transformation to reduce impact of extreme values:
data$Salary_log <- log(data$Salary)
Summary Table
Task Method R Syntax / Function
Detect missing values Identify NA is.na(), colSums(is.na())
Handle missing values Remove na.omit()
Handle missing values Replace with mean/median/mode data$col[is.na()] <- mean/median/mode
Detect outliers Boxplot boxplot(), boxplot.stats()
Detect outliers Z-score scale(), abs(z_score) > 3
Detect outliers IQR method quantile(), conditional filtering
Handle outliers Remove or Replace data[!(data$col %in% outliers), ]
1. Data Transformation
Meaning
• Data transformation is the process of converting data into a suitable format for analysis.
• Helps in improving interpretability, reducing skewness, and preparing data for modeling.
• Common transformations: log, square root, reciprocal, power, or scaling.
R Examples
# Sample dataset
data <- data.frame(Value = c(10, 50, 200, 500, 1000))
# Log transformation
data$Log_Value <- log(data$Value)
# Square root transformation
data$Sqrt_Value <- sqrt(data$Value)
# Reciprocal transformation
data$Reciprocal <- 1 / data$Value
2. Normalization (Min-Max Scaling)
Meaning
• Normalization scales data to a fixed range, usually 0 to 1.
• Formula:
[
X_{norm} = \frac{X - X_{min}}{X_{max} - X_{min}}
]
• Useful when variables have different units or scales, especially in machine learning.
R Example
# Min-Max normalization
data$Normalized <- (data$Value - min(data$Value)) / (max(data$Value) - min(data$Value))
Result: All values are scaled between 0 and 1.
3. Standardization (Z-Score Scaling)
Meaning
• Standardization transforms data to have mean = 0 and standard deviation = 1.
• Formula:
[
X_{std} = \frac{X - \bar{X}}{SD}
]
• Useful when comparing variables with different scales or for machine learning algorithms like
SVM, KNN, or PCA.
R Example
# Z-score standardization
data$Standardized <- scale(data$Value)
# Check mean and sd
mean(data$Standardized) # Should be close to 0
sd(data$Standardized) # Should be 1
4. Summary Table
Method Purpose Formula R Syntax
Transformation Reduce skewness, improve Log, sqrt, log(x), sqrt(x), 1/x
interpretability reciprocal
Normalization Scale data to 0–1 (X - min)/(max - (x - min(x)) / (max(x) -
min) min(x))
Standardization Scale data to mean 0, SD 1 (X - mean)/SD scale(x)
UNIT – 4
1. Meaning of Predictive Analytics
• Predictive analytics is the process of using historical data to make predictions about future
events.
• It uses statistical, machine learning, and data mining techniques to forecast trends, behaviors, or
outcomes.
• Helps businesses in decision-making, risk management, and strategy planning.
2. Purpose of Predictive Analytics
1. Forecast sales, demand, or revenue.
2. Predict customer behavior or churn.
3. Detect fraud or risk.
4. Optimize operations or resources.
5. Support marketing strategies and personalized offers.
3. Common Predictive Analytics Techniques in R
Technique Purpose R Functions / Packages
Linear Regression Predict continuous outcomes lm(), caret
Logistic Regression Predict binary outcomes glm(family="binomial"), caret
Decision Trees Predict categorical or continuous rpart(), tree()
Random Forest Improve prediction using ensemble randomForest()
k-Nearest Neighbors (kNN) Classification and regression class::knn()
Support Vector Machines Classification & regression e1071::svm()
Time Series Forecasting Predict future values forecast::auto.arima(), prophet
1. Supervised Learning
Meaning
• Supervised learning is a type of machine learning where the model is trained on labeled data,
meaning each input comes with a known output.
• The algorithm learns to predict the output from input features.
Key Features
• Requires labeled dataset.
• Goal: Predict or classify outcomes.
• Feedback is provided during training.
• Commonly used for regression and classification problems.
Examples

Problem Type Example

Regression Predict house prices based on area, rooms, location

Classification Predict if an email is spam or not

R Functions / Packages
• lm() → Linear regression
• glm() → Logistic regression
• rpart() → Decision tree
• randomForest() → Random forest
2. Unsupervised Learning
Meaning
• Unsupervised learning is a type of machine learning where the data is unlabeled, meaning no known
output exists.
• The algorithm tries to find hidden patterns, structures, or groups in the data.
Key Features
• Works with unlabeled data.
• Goal: Discover patterns or clusters.
• No explicit feedback is provided.
• Commonly used for clustering and dimensionality reduction.
Examples

Problem Type Example

Clustering Segment customers based on buying behavior

Dimensionality Reduction Reduce features for visualization or model efficiency

R Functions / Packages
• kmeans() → K-means clustering
• hclust() → Hierarchical clustering
• prcomp() → Principal Component Analysis (PCA)
3. Difference Table
Aspect Supervised Learning Unsupervised Learning
Data Labeled Unlabeled
Goal Predict or classify Find patterns or structure
Feedback Provided Not provided
Output Continuous (regression) or categorical Groups, clusters, patterns
(classification)
Examples Linear regression, Logistic regression, Decision tree K-means, Hierarchical clustering,
PCA
Evaluation Accuracy, RMSE, Precision, Recall Silhouette score, Davies-Bouldin
index
1. Regression Analytics
Meaning
• Regression analytics is a statistical technique used to study the relationship between a dependent
variable (outcome) and one or more independent variables (predictors).
• It helps in predicting outcomes, understanding relationships, and making data-driven decisions.
• Widely used in business, finance, economics, and healthcare for forecasting and trend analysis.
Purpose
1. Predict future values of the dependent variable.
2. Understand the effect of one or more independent variables.
3. Identify significant predictors influencing the outcome.
2. Simple Linear Regression
Meaning
• Simple linear regression is a type of regression where one independent variable (X) predicts one
dependent variable (Y).
• The relationship is assumed to be linear.
R Example
# Sample dataset
data <- data.frame(
Advertising = c(50, 60, 65, 70, 80),
Sales = c(200, 250, 300, 350, 400)
)
# Simple linear regression model
model <- lm(Sales ~ Advertising, data = data)
# Summary of the model
summary(model)
# Predict sales for new advertising budget
predict(model, newdata = data.frame(Advertising = c(55, 75)))
Interpretation
• Coefficient ((\beta_1)) → Change in Sales per unit increase in Advertising.
• R-squared → Proportion of variance in Sales explained by Advertising.
• p-value → Significance of predictor.
3. Multiple Linear Regression
Meaning
• Multiple linear regression is a type of regression where two or more independent variables are
used to predict a single dependent variable.
• Helps understand combined effect of multiple predictors.
R Example
# Sample dataset
data$Price <- c(10, 12, 11, 13, 14)
# Multiple linear regression model
model2 <- lm(Sales ~ Advertising + Price, data = data)
# Summary of the model
summary(model2)
Interpretation
• Each coefficient ((\beta_i)) → Effect of that predictor while holding others constant.
• Adjusted R-squared → Proportion of variance explained by all predictors together.
Summary Table
Type Number of Purpose R Function
Predictors
Simple Linear 1 Predict outcome from a single lm(Y ~ X)
Regression variable
Multiple Linear 2 or more Predict outcome from multiple lm(Y ~ X1 + X2 +
Regression variables ...)
1. Logistic Regression
Meaning
• Logistic regression is a statistical method used for binary classification problems, where the
dependent variable has two possible outcomes (e.g., Yes/No, 0/1).
• It models the probability of an event occurring using the logistic (sigmoid) function.
Key Features
• Predicts probabilities between 0 and 1.
• Can be extended to multinomial logistic regression for multiple classes.
• Assumes a linear relationship between independent variables and the log-odds of the outcome.
R Example
# Sample dataset
data <- data.frame(
Hours_Studied = c(2, 3, 5, 7, 1, 4),
Pass = c(0, 0, 1, 1, 0, 1)
)
# Logistic regression model
model_log <- glm(Pass ~ Hours_Studied, data = data, family = binomial)
# Summary
summary(model_log)
# Predict probability
predict(model_log, newdata = data.frame(Hours_Studied = c(6, 2)), type = "response")
2. Decision Tree
Meaning
• Decision Tree is a tree-like model used for classification and regression.
• Splits data into branches based on feature values to reach a decision at the leaves.
• Easy to interpret and visualize.
Key Features
• Can handle categorical and numerical data.
• Works well with non-linear relationships.
• Susceptible to overfitting; often combined with ensemble methods like Random Forest.
R Example
library(rpart)
# Sample dataset
data <- data.frame(
Age = c(25, 30, 45, 35, 50),
Income = c(50000, 60000, 80000, 70000, 90000),
Purchased = c("No", "No", "Yes", "Yes", "Yes")
)
# Build decision tree
tree_model <- rpart(Purchased ~ Age + Income, data = data, method = "class")
# Plot the tree
plot(tree_model)
text(tree_model, pretty = 0)
3. K-Nearest Neighbors (KNN)
Meaning
• KNN is a distance-based classification algorithm that assigns a class to a data point based on the
majority class of its k-nearest neighbors.
• Non-parametric and simple to understand.
Key Features
• No training phase; lazy learner.
• Works best with small to medium datasets.
• Sensitive to feature scaling (normalization recommended).
R Example
library(class)
# Sample dataset
train_data <- data.frame(
X1 = c(1, 2, 3, 6, 7, 8),
X2 = c(2, 3, 4, 7, 8, 9)
)
train_labels <- c("A", "A", "A", "B", "B", "B")
test_data <- data.frame(
X1 = c(4, 5),
X2 = c(5, 6)
)

# KNN classification (k = 3)
pred <- knn(train = train_data, test = test_data, cl = train_labels, k = 3)
pred
Comparison Table of Classification Techniques
Technique Type Dependent Advantages Limitations
Variable
Logistic Parametric Binary / Probabilities, Assumes linear log-
Regression Multinomial interpretable odds relationship
Decision Tree Non- Categorical / Easy visualization, Overfitting, sensitive to
parametric Continuous handles non-linearity small data changes
K-Nearest Non- Categorical Simple, no training Sensitive to scaling,
Neighbors (KNN) parametric required large datasets slow
1. Clustering Technique
Meaning
• Clustering is an unsupervised machine learning technique used to group similar data points
together based on their characteristics.
• The main goal is to identify hidden patterns or structures in the data without prior labels.
• Widely used in customer segmentation, market analysis, and anomaly detection.
Key Features
• Works with unlabeled data.
• Groups data points into clusters such that similar points are in the same cluster and dissimilar
points are in different clusters.
• No predefined output; patterns emerge from the data itself.
2. K-Means Clustering
Meaning
• K-Means clustering is a method where data is divided into K clusters.
• Each data point is assigned to the nearest cluster center (centroid).
• Iteratively updates centroids to minimize within-cluster variance.
R Example
# Sample dataset
data <- data.frame(
X = c(1, 2, 3, 8, 9, 10),
Y = c(2, 3, 4, 7, 8, 9)
)
# K-Means clustering with 2 clusters
set.seed(123)
kmeans_model <- kmeans(data, centers = 2)
# Cluster assignment
kmeans_model$cluster
# Cluster centers
kmeans_model$centers
Visualization
library(ggplot2)
data$Cluster <- as.factor(kmeans_model$cluster)
ggplot(data, aes(X, Y, color = Cluster)) + geom_point(size = 3)
3. Hierarchical Clustering
Meaning
• Hierarchical clustering builds a tree-like structure (dendrogram) showing nested groupings of data
points.
• Does not require specifying the number of clusters beforehand.
• Can be agglomerative (bottom-up) or divisive (top-down).
R Example
# Sample dataset
data <- data.frame(
X = c(1, 2, 3, 8, 9, 10),
Y = c(2, 3, 4, 7, 8, 9)
)
# Compute distance matrix
dist_matrix <- dist(data)
# Hierarchical clustering
hc_model <- hclust(dist_matrix, method = "complete")
# Plot dendrogram
plot(hc_model, main = "Hierarchical Clustering Dendrogram")
rect.hclust(hc_model, k = 2, border = "red") # Draw 2 clusters
1. Predictive Models for Sales Forecasting
Meaning
• Sales forecasting models predict future sales or demand based on historical sales data.
• Helps businesses plan inventory, optimize marketing, and manage resources effectively.
Common Predictive Models
Model Purpose R Functions / Packages
Linear Regression Predict continuous sales based on lm()
predictors (advertising, price, season)
Time Series Models Predict future sales trends over time forecast::auto.arima(),
forecast::ets(), prophet
Exponential Smoothing Smooth past data and forecast forecast::HoltWinters()
Random Forest Regression Handle complex nonlinear randomForest()
relationships
ARIMA (Auto-Regressive Forecast based on time series forecast::auto.arima()
Integrated Moving Average) patterns
R Example: Linear Regression for Sales
data <- data.frame(
Advertising = c(50, 60, 65, 70, 80),
Price = c(10, 12, 11, 13, 14),
Sales = c(200, 250, 300, 350, 400)
)
# Build linear regression model
model <- lm(Sales ~ Advertising + Price, data = data)
summary(model)
# Predict sales
predict(model, newdata = data.frame(Advertising = 75, Price = 12))
R Example: Time Series Forecasting
library(forecast)
sales_ts <- ts(c(200, 250, 300, 350, 400), frequency = 1)
model_arima <- auto.arima(sales_ts)
forecast(model_arima, h = 3) # Forecast next 3 periods
2. Predictive Models for Customer Segmentation
Meaning
• Customer segmentation divides customers into homogeneous groups based on behavior,
demographics, or purchase patterns.
• Helps in targeted marketing, personalized offers, and loyalty programs.
Common Predictive Models
R Example: K-Means for Customer Segmentation
data <- data.frame(
Age = c(25, 30, 45, 35, 50),
Income = c(50000, 60000, 80000, 70000, 90000)
)
# Apply K-Means with 2 clusters
set.seed(123)
kmeans_model <- kmeans(data, centers = 2)
# Cluster assignments
kmeans_model$cluster
Visualization
library(ggplot2)
UNIT – 5
1. Meaning of Text Mining
• Text mining (also called text data mining or text analytics) is the process of extracting useful
information, patterns, or insights from unstructured text data.
• It involves analyzing textual content from sources like documents, emails, social media posts, web
pages, and reviews.
• The goal is to convert unstructured text into structured data that can be analyzed statistically or
used in predictive models.
2. Key Features
1. Works with unstructured text data.
2. Uses techniques from Natural Language Processing (NLP), machine learning, and statistics.
3. Helps in identifying trends, sentiment, keywords, or topics in text.
4. Can be applied for classification, clustering, sentiment analysis, or recommendation systems.
Example in R
library(tm)
# Sample text data
text <- c("I love data analytics", "Text mining is useful", "R language is great for text analysis")
# Create a text corpus
corpus <- Corpus(VectorSource(text))
# Preprocessing: convert to lowercase
corpus <- tm_map(corpus, content_transformer(tolower))
# Remove punctuation
corpus <- tm_map(corpus, removePunctuation)
# Remove stopwords
corpus <- tm_map(corpus, removeWords, stopwords("en"))
# View cleaned text
inspect(corpus)
1. Text Mining Algorithms
Text mining uses various algorithms to extract meaningful information from unstructured text. The
choice of algorithm depends on the goal, such as classification, clustering, or topic extraction.
a) Bag-of-Words (BoW)
• Converts text into a matrix of word frequencies.
• Each document is represented as a vector of word counts.
• Useful for classification and clustering.
R Example:
library(tm)
text <- c("I love data analytics", "Text mining is useful")
corpus <- Corpus(VectorSource(text))
dtm <- DocumentTermMatrix(corpus)
inspect(dtm)
b) TF-IDF (Term Frequency – Inverse Document Frequency)
• Measures the importance of a word in a document relative to a corpus.
• Reduces the weight of common words like “the” and highlights important words.
R Example:
dtm_tfidf <- DocumentTermMatrix(corpus, control = list(weighting = weightTfIdf))
inspect(dtm_tfidf)
c) N-Grams
• Considers sequences of N words (e.g., bigrams = 2 words, trigrams = 3 words).
• Captures context and phrases in text.
R Example:
library(RWeka)
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
dtm_bigram <- DocumentTermMatrix(corpus, control = list(tokenize = BigramTokenizer))
inspect(dtm_bigram)
d) Topic Modeling (LDA - Latent Dirichlet Allocation)
• Uncovers hidden topics in a collection of documents.
• Each document is a mixture of topics, and each topic is a mixture of words.
R Example:
library(topicmodels)
lda_model <- LDA(dtm, k = 2) # 2 topics
terms(lda_model, 5) # top 5 terms per topic
e) Text Classification / Machine Learning
• Uses ML algorithms like Naive Bayes, SVM, Random Forest, or Deep Learning.
• Goal: classify text into categories (e.g., spam detection, sentiment).
R Example (Naive Bayes):
library(e1071)
# Assuming dtm_train is training DTM and labels_train are labels
model_nb <- naiveBayes(as.matrix(dtm_train), labels_train)
predict(model_nb, newdata = as.matrix(dtm_test))
2. Sentiment Analysis
Meaning
• Sentiment Analysis (or Opinion Mining) is the process of determining the emotional tone of a text.
• Classifies text into positive, negative, or neutral sentiments.
• Useful for customer feedback, social media monitoring, and brand analysis.
Common Approaches
1. Lexicon-Based Approach: Uses a predefined dictionary of positive and negative words.
2. Machine Learning Approach: Trains models (Naive Bayes, SVM) to classify sentiment based on
labeled data.
R Example: Lexicon-Based Sentiment Analysis
library(syuzhet)
# Sample text
text <- c("I love this product", "The service is terrible", "It is okay")
# Get sentiment scores
sentiment <- get_sentiment(text, method = "bing")
sentiment
Interpretation:
• Positive values → Positive sentiment
• Negative values → Negative sentiment
• Zero → Neutral sentiment
R Example: Visualizing Sentiment
library(ggplot2)
sent_df <- data.frame(Text=text, Sentiment=sentiment)
ggplot(sent_df, aes(x=Text, y=Sentiment, fill=Sentiment>0)) +
geom_bar(stat="identity") +
labs(title="Sentiment Analysis", x="Text", y="Sentiment Score")
1. caret (Classification And Regression Training)
Purpose
• caret is a comprehensive R package used for training and evaluating machine learning models.
• Provides a unified interface to multiple ML algorithms.
• Useful for data preprocessing, model tuning, and performance evaluation.
Key Features
1. Supports over 200 machine learning algorithms.
2. Provides data splitting, cross-validation, and hyperparameter tuning.
3. Offers preprocessing functions like normalization, scaling, and missing value imputation.
R Example
library(caret)
# Sample dataset
data(iris)
set.seed(123)
# Split data into training and testing
trainIndex <- createDataPartition(iris$Species, p=0.7, list=FALSE)
trainData <- iris[trainIndex, ]
testData <- iris[-trainIndex, ]
# Train a decision tree model using caret
model <- train(Species ~ ., data=trainData, method="rpart")
pred <- predict(model, testData)
confusionMatrix(pred, testData$Species)
2. e1071
Purpose
• e1071 provides tools for Support Vector Machines (SVM), Naive Bayes, and other statistical
learning methods.
• Widely used for classification and regression tasks.
Key Features
1. Implements SVM with different kernels.
2. Provides Naive Bayes classifier.
3. Includes functions for clustering and statistical computations.
R Example
library(e1071)
# SVM classification
model_svm <- svm(Species ~ ., data=trainData)
pred_svm <- predict(model_svm, testData)
table(pred_svm, testData$Species)
# Naive Bayes classification
model_nb <- naiveBayes(Species ~ ., data=trainData)
pred_nb <- predict(model_nb, testData)
table(pred_nb, testData$Species)
3. xgboost
Purpose
• xgboost is a high-performance package for gradient boosting, widely used for structured/tabular
data in competitions like Kaggle.
• Boosting combines multiple weak learners to create a strong predictive model.
Key Features
1. Fast and scalable gradient boosting implementation.
2. Handles missing values internally.
3. Supports regression, classification, and ranking.
4. Offers feature importance and regularization to avoid overfitting.
R Example
library(xgboost)
# Prepare data
train_matrix <- xgb.DMatrix(data = as.matrix(trainData[, -5]), label =
as.numeric(trainData$Species)-1)
test_matrix <- xgb.DMatrix(data = as.matrix(testData[, -5]), label = as.numeric(testData$Species)-1)
# Train XGBoost model
model_xgb <- xgboost(data = train_matrix, max.depth = 3, eta = 0.1, nrounds = 50, objective =
"multi:softmax", num_class = 3)
# Predict
pred_xgb <- predict(model_xgb, test_matrix)
table(pred_xgb, as.numeric(testData$Species)-1)
4. randomForest
Purpose
• randomForest is used for ensemble learning, combining multiple decision trees to improve
prediction accuracy.
• Suitable for classification and regression problems.
Key Features
1. Handles large datasets with high dimensionality.
2. Provides feature importance measures.
3. Reduces overfitting compared to a single decision tree.
R Example
library(randomForest)
# Train Random Forest model
model_rf <- randomForest(Species ~ ., data=trainData, ntree=100)
pred_rf <- predict(model_rf, testData)
table(pred_rf, testData$Species)
# Feature importance
importance(model_rf)
varImpPlot(model_rf)

You might also like