R Language Notes
R Language Notes
Meaning of R Language
R is a programming language and software environment used mainly for statistical computing, data
analysis, and graphical representation. It was developed by Ross Ihaka and Robert Gentleman at the
University of Auckland, New Zealand. R is widely used by statisticians, data scientists, and researchers for
analyzing data, creating visualizations, and performing machine learning and predictive modeling tasks.
It is an open-source language, meaning anyone can freely use and modify it. R provides a wide range of
built-in functions and packages that simplify complex data operations.
Features of R Language
1. Open Source and Free
R is completely free to download and use. It is open-source software, so users can modify and enhance its
code according to their needs without any licensing fees.
2. Data Handling and Storage
R offers excellent data handling capabilities. It can easily manage vectors, lists, matrices, and data frames,
making it ideal for structured and unstructured data analysis.
3. Statistical and Mathematical Functions
R provides a vast library of statistical tools such as mean, median, standard deviation, regression models,
hypothesis testing, and time-series analysis. This makes it a powerful tool for research and analytics.
4. Graphical Capabilities
R has strong visualization features. It can create a wide variety of graphs like histograms, bar charts, scatter
plots, and advanced 3D visualizations using libraries like ggplot2.
5. Extensible with Packages
Thousands of packages are available in CRAN (Comprehensive R Archive Network) that extend R’s
capabilities for specialized tasks such as data mining, machine learning, bioinformatics, and finance.
6. Cross-Platform Compatibility
R can run on different operating systems like Windows, macOS, and Linux, ensuring flexibility and
portability for users.
7. Integration with Other Languages
R can integrate with other programming languages like C, C++, Java, and Python, allowing developers to
use R alongside other tools.
8. Active Community Support
R has a large and active community of developers, researchers, and data analysts who contribute packages,
provide tutorials, and assist users through online forums and documentation.
Advantages of R Language (5 Points)
1. Open Source and Free – R is freely available to everyone, making it cost-effective for students,
researchers, and professionals.
2. Strong Statistical and Analytical Support – It offers powerful tools for statistical analysis, data
manipulation, and advanced modeling.
3. Excellent Data Visualization – R provides high-quality graphical tools to create charts, plots, and
interactive visualizations.
4. Extensive Package Collection – Thousands of packages are available in CRAN for various fields
like finance, machine learning, and bioinformatics.
5. Cross-Platform and Community Support – R works on all major operating systems and has a
large, active global community for support and updates.
Purpose of R Language
The main purpose of R Language is to perform data analysis, statistical computing, and graphical
visualization. It is designed to help researchers, data analysts, and statisticians:
• Collect, clean, and organize data efficiently.
• Analyze data using statistical and mathematical methods.
• Visualize results through graphs and charts.
• Build predictive models using machine learning algorithms.
• Support research and academic work in data-driven fields.
Step-by-Step Installation Process of R and RStudio
Step 1: Download R
1. Go to the official R website: https://cran.r-project.org
2. Select your operating system (Windows / macOS / Linux).
3. For Windows, click "Download R for Windows" → "base" → then click "Download R-x.x.x for
Windows" (the latest version).
Step 2: Install R
1. Once the file is downloaded, open it.
2. Click Next to proceed through the setup wizard.
3. Choose your installation path (default is fine).
4. Select components (keep all selected).
5. Click Next until installation completes.
6. Click Finish to close the setup window.
Step 3: Download RStudio
1. Visit https://posit.co/download/rstudio/ (formerly RStudio).
2. Click “Download RStudio Desktop” (Free version).
3. Choose your operating system and download the installer file.
Step 4: Install RStudio
1. Open the downloaded RStudio installer file.
2. Follow the on-screen instructions (Next → Install → Finish).
3. Once installed, open RStudio. It will automatically detect your R installation.
Step 5: Verify Installation
1. Open RStudio.
2. In the Console window, type the following command and press Enter:
3. version
This will display your installed R version details, confirming successful installation.
Different Data Types in R Language
R supports several basic data types that define the kind of data a variable can store.
These are: Numeric, Integer, Character, Logical, Complex, and Raw.
1. Numeric Data Type
Numeric type represents decimal or real numbers. It is used for most mathematical operations.
Example Code:
• # Numeric data type example • class(x) # Output: "numeric"
• x <- 10.5 • class(y) # Output: "numeric"
• y <- 20
2. Integer Data Type
Integer type represents whole numbers without decimals.
You can specify integers by adding “L” after the number.
Example Code:
• # Integer data type example • class(a) # Output: "integer"
• a <- 25L • class(b) # Output: "integer"
• b <- 100L
3. Character Data Type
Character type stores text or string values.
Strings are enclosed within single (‘ ’) or double (“ ”) quotes.
Example Code:
• # Character data type example • class(name) # Output: "character"
• name <- "Dharun" • class(city) # Output: "character"
• city <- 'Chennai'
4. Logical Data Type
Logical type stores Boolean values — either TRUE or FALSE.
It is used in conditions and comparisons.
Example Code:
• # Logical data type example • class(x) # Output: "logical"
• x <- 5 > 3 • class(y) # Output: "logical"
• y <- 10 < 8
5. Complex Data Type
Complex type is used to store numbers with both real and imaginary parts.
Example Code:
• # Complex data type example • z1 <- 3 + 2i
• z2 <- 5 - 4i • class(z2) # Output: "complex"
• class(z1) # Output: "complex"
6. Raw Data Type
Raw type is used to store data in its raw byte form.
It is useful for low-level operations like file or binary data handling.
Example Code:
• # Raw data type example • r
• r <- charToRaw("ABC") • class(r) # Output: "raw"
DATA STRUCTURES IN R LANGUAGE
A data structure in R is a way to store and organize data efficiently for analysis and computation.
R provides different types of data structures depending on how data elements are arranged and accessed.
1. VECTOR
A vector is the simplest data structure in R.
It is a sequence of elements that are of the same data type (numeric, character, or logical).
Example Code:
# Vector examples
✓ numeric_vector <- c(10, 20, 30, 40)
✓ character_vector <- c("R", "Language", "Learning")
✓ logical_vector <- c(TRUE, FALSE, TRUE)
✓ # Display class
✓ class(numeric_vector) # Output: "numeric"
✓ print(character_vector)
Note: All elements in a vector must be of the same type.
2. LIST
A list can hold elements of different data types such as numbers, strings, vectors, or even other lists.
Lists are useful for storing complex or mixed data.
Example Code:
✓ # List example
✓ my_list <- list(Name = "Dharun", Age = 22, Marks = c(80, 85, 90))
✓ print(my_list)
✓ # Access elements
✓ my_list$Name
Note: Lists are flexible because they can store varied data under one object.
3. MATRIX
A matrix is a two-dimensional data structure that holds elements of the same data type arranged in rows
and columns.
Example Code:
✓ # Matrix example
✓ matrix_data <- matrix(1:9, nrow = 3, ncol = 3)
✓ print(matrix_data)
✓ # Access elements
✓ matrix_data[2,3] # Element in 2nd row, 3rd column
4. ARRAY
An array is similar to a matrix but can have more than two dimensions.
It can store data in multiple layers.
Example Code:
✓ # Array example
✓ array_data <- array(1:12, dim = c(3, 2, 2))
✓ print(array_data)
5. DATA FRAME
A data frame is a table-like data structure where each column can contain different data types (numeric,
character, logical).
It is one of the most commonly used structures for datasets.
Example Code:
# Data frame example
student_data <- data.frame(
Name = c("Dharun", "Ravi", "Kumar"),
Age = c(22, 21, 23),
Marks = c(85, 90, 88)
)
✓ print(student_data)
✓ # Access specific column
✓ student_data$Name
Note: Data frames are widely used for importing and analyzing datasets in R.
6. FACTOR
A factor is used to store categorical data such as gender, grade, or region.
It assigns integer values to represent text labels internally.
Example Code:
# Factor example
gender <- factor(c("Male", "Female", "Female", "Male"))
print(gender)
levels(gender)
Note: Factors are important for statistical modeling and categorical data analysis.
SUMMARY TABLE
Data Dimension Data Type Description Example
Structure
Vector 1D Same Sequence of elements of same c(1,2,3,4)
type
List 1D Different Collection of elements of list("R", 10, TRUE)
different types
Matrix 2D Same Elements arranged in rows and matrix(1:9, 3, 3)
columns
Array nD Same Multi-dimensional extension of array(1:12, c(3,2,2))
matrix
Data Frame 2D Different Table-like structure with mixed data.frame(Name, Age)
columns
Factor 1D Categorical Stores categorical variables factor(c("Male","Female"))
STEPS IN LOADING PACKAGES IN R
Meaning
✓ In R, a package is a collection of functions, data, and documentation that extends the capabilities
of R.
For example, packages like ggplot2, dplyr, and readxl are used for data visualization, manipulation,
and Excel file handling.
✓ Before using a package, it must be installed and then loaded into the R environment.
Steps to Load Packages in R
Step 1: Install the Package
✓ Before loading a package for the first time, you must install it from CRAN (Comprehensive R
Archive Network).
Use the install.packages() function to do this.
Example Code:
# Step 1: Installing a package
install.packages("ggplot2")
✓ Explanation:
This command downloads and installs the package ggplot2 from CRAN into your system library.
Step 2: Load the Package into R
After installation, load the package into the current R session using the library() function.
Example Code:
# Step 2: Loading the package
library(ggplot2)
Explanation:
This command loads ggplot2 so that its functions can be used in your R program.
Step 3: Check if the Package is Loaded
You can verify whether a package is loaded using the search() or sessionInfo() functions.
Example Code:
# Step 3: Checking loaded packages
✓ search() # Shows currently loaded packages
✓ sessionInfo() # Displays R version and attached packages
Step 4: Use Functions from the Package
Once the package is loaded, you can start using its functions directly.
Example Code:
# Step 4: Using a function from ggplot2 package
✓ data(mpg)
✓ ggplot(mpg, aes(x=displ, y=hwy)) + geom_point()
Step 5: Update or Remove a Package (Optional)
If needed, you can update or uninstall a package using these commands:
Example Code:
✓ # Updating a package
✓ update.packages("ggplot2")
✓ # Removing a package
✓ remove.packages("ggplot2")
Summary Table
Step Command Purpose
Step 1 install.packages("packagename") Installs the package from CRAN
Step 2 library(packagename) Loads the package into the R session
Step 3 search() or sessionInfo() Checks loaded packages
Step 4 Use package functions Perform tasks using the package
Step 5 update.packages() / remove.packages() Update or uninstall packages
DIFFERENT TYPES OF OPERATORS IN R STUDIO
Meaning:
Operators in R are symbols that tell the R interpreter to perform specific mathematical or logical
computations.
They are used to manipulate variables and values in expressions.
R supports several types of operators such as arithmetic, relational, logical, assignment, and
miscellaneous.
1. Arithmetic Operators
These operators are used to perform basic mathematical calculations like addition, subtraction,
multiplication, and division.
Operator Description Example Result
+ Addition 10 + 5 15
- Subtraction 10 - 5 5
* Multiplication 10 * 5 50
/ Division 10 / 5 2
^ or ** Exponentiation 2^3 8
%% Modulus (remainder) 10 %% 3 1
%/% Integer division 10 %/% 3 3
Example Code:
✓ # Arithmetic Operators ✓ mod <- x %% y
✓ x <- 10 ✓ int_div <- x %/% y
✓ y <- 3
✓ add <- x + y
✓ sub <- x - y ✓ print(add)
✓ mul <- x * y ✓ print(mod)
✓ div <- x / y
✓ exp <- x ^ y
2. Relational Operators
These operators are used to compare two values.
The result of a relational operation is always TRUE or FALSE.
Operator Description Example Result
> Greater than 5>3 TRUE
< Less than 5<3 FALSE
== Equal to 5 == 5 TRUE
!= Not equal to 5 != 3 TRUE
>= Greater than or equal to 5 >= 3 TRUE
<= Less than or equal to 5 <= 3 FALSE
Example Code:
# Relational Operators print(a > b)
a <- 5 print(a == b)
b <- 3 print(a != b)
3. Logical Operators
Logical operators are used to combine or test logical (TRUE/FALSE) values.
Operator Description Example Result
& Element-wise AND TRUE & FALSE
FALSE
` ` Element-wise `TRUE
OR
! NOT operator !TRUE FALSE
&& Logical AND (checks first element (5>3) && (2>1) TRUE
only)
` ` Logical OR (checks first element
only)
4. Assignment Operators
Assignment operators are used to assign values to variables.
Operator Description Example
SUMMARY TABLE
Type of Operator Purpose Example
Arithmetic Operators Perform mathematical operations 10 + 5, 10 %% 3
Relational Operators Compare values 5 > 3, 5 == 5
Logical Operators Combine or test conditions TRUE & FALSE, !TRUE
Assignment Operators Assign values to variables x <- 10, x = 20
Miscellaneous Operators Special tasks like sequence or matrix operations 1:5, %in%, %*%
MAIN FUNCTIONS IN R STUDIO
Meaning
Functions in R are predefined sets of instructions that perform specific tasks such as mathematical
operations, data manipulation, or analysis.
A function in R usually has the format:
function_name(arguments)
For example:
sum(10, 20)
R also allows users to create their own custom functions using the function() keyword.
1. Mathematical Functions
These functions perform mathematical and arithmetic operations on numeric data.
Common Mathematical Functions:
Function Description Example Code Result
sum() Adds all values sum(10, 20, 30) 60
mean() Calculates average mean(c(10, 20, 30)) 20
max() Returns maximum value max(c(5, 10, 15)) 15
min() Returns minimum value min(c(5, 10, 15)) 5
sqrt() Square root sqrt(16) 4
abs() Absolute value abs(-10) 10
2. Statistical Functions
These functions are used to perform basic statistical analysis.
Function Description Example Code Result
median() Finds the middle value median(c(2, 4, 6, 8, 10)) 6
sd() Standard deviation sd(c(2, 4, 6, 8, 10)) 2.828
var() Variance var(c(2, 4, 6, 8, 10)) 8
range() Minimum and maximum values range(c(1, 3, 5, 7)) 17
quantile() Returns quantiles quantile(c(1, 2, 3, 4, 5)) 0%,25%,50%,75%,100%
# Load package
library(plotly)
Basic Usage
a) Interactive Scatter Plot
data <- data.frame(
x = c(1, 2, 3, 4, 5),
y = c(10, 15, 13, 17, 20)
)
plot_ly(data, x = ~x, y = ~y, type = 'scatter', mode = 'markers') %>%
layout(title = "Interactive Scatter Plot",
xaxis = list(title = "X Values"),
yaxis = list(title = "Y Values"))
b) Interactive Bar Plot
plot_ly(data, x = ~x, y = ~y, type = 'bar', name = 'Values') %>%
layout(title = "Interactive Bar Plot")
c) Converting ggplot2 to Interactive Plotly
library(ggplot2)
library(plotly)
p <- ggplot(data, aes(x = x, y = y)) + geom_line()
ggplotly(p) # Converts ggplot2 to interactive plot
Exporting Plotly Plots
# Save plotly plot as HTML
library(htmlwidgets)
p <- plot_ly(data, x = ~x, y = ~y, type = 'scatter', mode = 'lines')
saveWidget(p, "interactive_plot.html", selfcontained = TRUE)
DATA VISUALIZATION: CHARTS, GRAPHS, AND MAPS IN R STUDIO
Meaning
• Data visualization in R Studio refers to representing data graphically to identify patterns, trends,
and insights
• It helps communicate information clearly and supports decision-making.
Charts and graphs display numerical and categorical data, while maps visualize geographical
data.
1. Charts in R Studio
Meaning
Charts are graphical representations of data, often used for categorical comparisons.
Types and Examples
a) Bar Chart
library(ggplot2)
data <- data.frame(
Category = c("A", "B", "C"),
Value = c(10, 20, 15)
)
ggplot(data, aes(x = Category, y = Value, fill = Category)) +
geom_bar(stat = "identity") +
ggtitle("Bar Chart Example")
b) Pie Chart
# Pie chart using base R
values <- c(10, 20, 15)
labels <- c("A", "B", "C")
pie(values, labels = labels, col = rainbow(length(values)), main = "Pie Chart Example")
c) Line Chart
ggplot(data, aes(x = Category, y = Value, group = 1)) +
geom_line(color = "blue") +
geom_point(color = "red") +
ggtitle("Line Chart Example")
2. Graphs in R Studio
Meaning
Graphs are used to show relationships between variables, often continuous or numeric data.
Types and Examples
a) Scatter Plot
data <- data.frame(x = c(1,2,3,4,5), y = c(5,7,6,8,9))
ggplot(data, aes(x = x, y = y)) +
geom_point(color = "darkgreen", size = 3) +
ggtitle("Scatter Plot Example")
b) Histogram
data <- data.frame(values = c(5,7,6,8,9,5,7,6,8,9))
ggplot(data, aes(x = values)) +
geom_histogram(binwidth = 1, fill = "skyblue", color = "black") +
ggtitle("Histogram Example")
c) Boxplot
data <- data.frame(
Category = rep(c("A","B","C"), each = 5),
Value = c(5,6,7,5,6,7,8,6,7,8,9,8,7,8,9)
)
ggplot(data, aes(x = Category, y = Value, fill = Category)) +
geom_boxplot() +
ggtitle("Boxplot Example")
3. Maps in R Studio
Meaning
Maps are used to visualize geographical data, such as locations, regions, or spatial patterns.
Packages
• ggplot2 + maps or mapdata for static maps
• leaflet for interactive maps
a) Static Map Example (ggplot2 + maps)
library(ggplot2)
library(maps)
world_map <- map_data("world")
ggplot(world_map, aes(x = long, y = lat, group = group)) +
geom_polygon(fill = "lightblue", color = "black") +
ggtitle("World Map Example")
b) Interactive Map Example (leaflet)
library(leaflet)
leaflet() %>%
addTiles() %>%
addMarkers(lng = 77.2090, lat = 28.6139, popup = "New Delhi")
Explanation:
• addTiles() adds the base map.
• addMarkers() places interactive markers with popups.
Summary Table
Type Purpose Example Packages / Functions
Charts Compare categories ggplot2: geom_bar(), geom_line(), pie()
Graphs Show relationships ggplot2: geom_point(), geom_histogram(), geom_boxplot()
Maps Visualize spatial data maps + ggplot2, leaflet for interactive maps
Scenario: Retail Sales Analysis for a Store
Business Context
A retail store wants to analyze its monthly sales performance to understand:
1. Which products are performing well.
2. Which months have high or low sales.
3. The distribution of sales across different regions.
The store has a dataset containing:
Month Product Units_Sold Revenue Region
Normality Data should follow a normal (bell-shaped) Required for t-tests, ANOVA,
distribution regression
Homogeneity of Variances across groups should be equal Required for ANOVA,
Variance regression
Independence Observations should be independent of each Required for most tests to
other avoid bias
Linearity Relationship between independent and Required for correlation and
dependent variable should be linear regression
No Multicollinearity Independent variables should not be highly Required for multiple
correlated regression
Random Sampling Data should be randomly selected Ensures generalizability of
results
How to Test Assumptions in R
1. Normality Test
• Shapiro-Wilk Test
shapiro.test(data$Variable)
• Interpretation:
o p-value > 0.05 → data is normally distributed
o p-value < 0.05 → data is not normally distributed
• Visual Check:
hist(data$Variable)
qqnorm(data$Variable)
qqline(data$Variable)
2. Homogeneity of Variance
• Levene’s Test (from car package)
library(car)
leveneTest(Variable ~ Group, data = data)
• p-value > 0.05 → equal variances
• p-value < 0.05 → variances are unequal
3. Linearity
• Scatter Plot
plot(data$X, data$Y)
abline(lm(Y ~ X, data = data), col="red")
• The plot should show a roughly straight-line relationship.
4. Multicollinearity
• Variance Inflation Factor (VIF)
library(car)
model <- lm(Y ~ X1 + X2 + X3, data = data)
vif(model)
• VIF > 10 indicates high multicollinearity.
1. Parametric Tests
Meaning
• Parametric tests are statistical tests that make assumptions about the population parameters
(e.g., mean, variance) and the underlying distribution of the data.
• Typically, they assume that the data is normally distributed and have homogeneity of variance.
• They are generally more powerful than non-parametric tests if assumptions are satisfied.
Common Parametric Tests in R Studio
Test Purpose R Function Example
t-test Compare means of two groups t.test(x ~ group, data =
data)
ANOVA (Analysis of Compare means of more than two groups aov(Y ~ Group, data =
Variance) data)
Pearson Correlation Measure linear relationship between two cor.test(x, y, method =
variables "pearson")
Linear Regression Model relationship between dependent and lm(Y ~ X, data = data)
independent variables
2. Correlation in R
Meaning
• Correlation measures the strength and direction of a linear relationship between two numeric
variables.
• Values range from -1 to +1:
o +1 → perfect positive correlation
o -1 → perfect negative correlation
o 0 → no correlation
R Example
# Example dataset
data <- data.frame(
Sales = c(200, 250, 300, 350, 400),
Advertising = c(50, 60, 65, 70, 80)
)
# Pearson correlation
cor.test(data$Sales, data$Advertising, method = "pearson")
3. Regression in R
Meaning
• Regression is used to model the relationship between a dependent variable (Y) and one or more
independent variables (X).
• Simple Linear Regression: One independent variable
• Multiple Linear Regression: Two or more independent variables
Simple Linear Regression Example
# Linear regression model
model <- lm(Sales ~ Advertising, data = data)
# View summary of the model
summary(model)
Interpretation:
• Coefficients → effect of independent variable on dependent variable
• R-squared → proportion of variance explained by the model
• p-value → significance of the predictor
Multiple Linear Regression Example
data$Price <- c(10, 12, 11, 13, 14)
# Multiple regression
model2 <- lm(Sales ~ Advertising + Price, data = data)
summary(model2)
Summary Table
Test / Method Purpose R Function
t-test Compare means of two groups t.test()
ANOVA Compare means of more than two groups aov()
Pearson Correlation Measure linear relationship cor.test(method="pearson")
Simple Linear Regression Model dependent ~ independent lm(Y ~ X)
Multiple Linear Regression Model dependent ~ multiple independents lm(Y ~ X1 + X2 + ...)
1. Independent Sample t-test
Purpose:
Compare the means of two independent groups to see if they are significantly different.
Example Dataset
# Sample data
data <- data.frame(
Group = rep(c("A", "B"), each = 5),
Score = c(85, 88, 90, 87, 86, 78, 80, 82, 79, 81)
)
# View data
data
t-test in R
t.test(Score ~ Group, data = data)
Interpretation:
• p-value < 0.05 → Significant difference between Group A and B
• p-value > 0.05 → No significant difference
2. One-Way ANOVA
Purpose:
Compare means of more than two groups to test if at least one group mean is different.
Example Dataset
# Sample data
data_anova <- data.frame(
Group = rep(c("A", "B", "C"), each = 5),
Score = c(85, 88, 90, 87, 86, 78, 80, 82, 79, 81, 92, 94, 91, 93, 95)
)
# View data
data_anova
ANOVA in R
anova_result <- aov(Score ~ Group, data = data_anova)
summary(anova_result)
Interpretation:
• p-value < 0.05 → At least one group mean is significantly different
• p-value > 0.05 → No significant difference among groups
3. Pearson Correlation
Purpose:
Measure the linear relationship between two numeric variables.
Example Dataset
# Sample data
data_corr <- data.frame(
Sales = c(200, 250, 300, 350, 400),
Advertising = c(50, 60, 65, 70, 80)
)
# View data
data_corr
Correlation in R
cor.test(data_corr$Sales, data_corr$Advertising, method = "pearson")
Interpretation:
• Correlation coefficient (r) indicates strength and direction:
o Positive → both increase together
o Negative → one increases, other decreases
• p-value < 0.05 → Significant correlation
Summary Table
Test Purpose R Function
t-test Compare means of 2 groups t.test()
ANOVA Compare means of 3+ groups aov()
Pearson Correlation Measure linear relationship cor.test(method="pearson")
1. Linear Regression
Meaning
• Linear regression is used to model the relationship between a dependent variable (Y) and one or
more independent variables (X), assuming a linear relationship.
• It predicts a continuous outcome based on predictor variables.
Assumptions
1. Linearity: Y and X have a linear relationship.
2. Independence of errors.
3. Homoscedasticity: Constant variance of residuals.
4. Normality: Residuals are normally distributed.
Example in R (Simple Linear Regression)
# Sample dataset
data <- data.frame(
Advertising = c(50, 60, 65, 70, 80),
Sales = c(200, 250, 300, 350, 400)
)
# Linear regression model
model <- lm(Sales ~ Advertising, data = data)
# View summary
summary(model)
Interpretation:
• Coefficients → Effect of Advertising on Sales
• R-squared → Proportion of variance explained
• p-value → Significance of predictor
Multiple Linear Regression
data$Price <- c(10, 12, 11, 13, 14)
# Multiple regression
model2 <- lm(Sales ~ Advertising + Price, data = data)
summary(model2)
• Predicts Sales based on multiple predictors (Advertising & Price).
2. Logistic Regression
Meaning
• Logistic regression is used to model the relationship between a binary categorical dependent
variable (Y) and independent variables (X).
• Outcome is 0 or 1 (e.g., Yes/No, Success/Failure).
• Predicts probability of occurrence using the logistic function.
Assumptions
1. Dependent variable is binary.
2. Independent variables can be continuous or categorical.
3. No multicollinearity among predictors.
4. Observations are independent.
Example in R (Binary Logistic Regression)
# Sample dataset
data <- data.frame(
Hours_Studied = c(2, 3, 5, 7, 1, 4),
Pass = c(0, 0, 1, 1, 0, 1) # 0=Fail, 1=Pass
)
# Logistic regression model
model_log <- glm(Pass ~ Hours_Studied, data = data, family = binomial)
# View summary
summary(model_log)
Prediction
# Predict probability of passing
predict(model_log, newdata = data.frame(Hours_Studied = c(6, 2)), type = "response")
• Output: Probability of passing for given hours studied.
1. Meaning of Data Cleaning and Preprocessing
Data Cleaning
• Data cleaning is the process of identifying and correcting errors, inconsistencies, or missing
values in the dataset.
• Ensures accuracy, completeness, and reliability of data before analysis.
Data Preprocessing
• Data preprocessing involves transforming raw data into a structured and analyzable format.
• Steps include handling missing values, removing duplicates, standardizing data, and converting
data types.
• Essential for improving the quality of analysis and machine learning models.
2. Common Steps in Data Cleaning & Preprocessing in R
Step 1: Inspecting the Data
# Load dataset
data <- read.csv("data.csv")
# View first few rows
head(data)
# Structure of dataset
str(data)
# Summary statistics
summary(data)
Step 2: Handling Missing Values
# Identify missing values
is.na(data)
# Remove rows with NA
data_clean <- na.omit(data)
# Replace missing values with mean (numeric column)
data$Age[is.na(data$Age)] <- mean(data$Age, na.rm = TRUE)
# Replace missing values with median
data$Salary[is.na(data$Salary)] <- median(data$Salary, na.rm = TRUE)
Step 3: Removing Duplicates
# Find duplicates
duplicated(data)
# Remove duplicate rows
data <- data[!duplicated(data), ]
Step 4: Correcting Data Types
# Convert to factor
data$Gender <- as.factor(data$Gender)
# Convert to numeric
data$Income <- as.numeric(data$Income)
# Convert to date
data$Date <- as.Date(data$Date, format="%Y-%m-%d")
Step 5: Renaming Columns
# Rename columns
colnames(data) <- c("ID", "Name", "Age", "Salary", "Gender")
Step 6: Handling Outliers
# Identify outliers using boxplot
boxplot(data$Salary)
# Replace outliers with median
data$Salary[data$Salary > 100000] <- median(data$Salary)
Step 7: Standardizing / Scaling Data
# Min-Max Normalization
data$Age <- (data$Age - min(data$Age)) / (max(data$Age) - min(data$Age))
# Z-score Standardization
data$Salary <- scale(data$Salary)
Step 8: String Cleaning (Removing Whitespaces, Special Characters)
library(stringr)
# Trim whitespaces
data$Name <- str_trim(data$Name)
# Remove special characters
data$Name <- str_replace_all(data$Name, "[^[:alnum:]]", "")
Step 9: Aggregation / Grouping (Optional)
library(dplyr)
# Group by Gender and summarize average Salary
data_summary <- data %>%
group_by(Gender) %>%
summarise(Average_Salary = mean(Salary, na.rm = TRUE))
Common Packages Used
• dplyr → Data manipulation, grouping, summarization
• tidyr → Reshaping and cleaning data (gather, spread)
• stringr → String manipulation and cleaning
• lubridate → Handling date and time
• janitor → Cleaning column names and data frames
Summary Table of Common Syntax
Task R Syntax / Function Purpose
Missing values is.na(), na.omit(), replace() Detect and handle NA
Duplicates duplicated(), !duplicated() Remove duplicate rows
Data types as.factor(), as.numeric(), as.Date() Correct column types
Outliers boxplot(), conditional replacement Identify and correct extreme values
Scaling scale(), normalization formula Standardize numeric data
String cleaning str_trim(), str_replace_all() Clean textual data
Aggregation group_by() %>% summarise() Summarize data by groups
1. Handling Missing Data
Meaning
• Missing data occurs when some values are not recorded or unavailable in the dataset.
• Handling missing data is crucial because it can bias results or affect model accuracy.
Common Methods to Handle Missing Data
a) Identify Missing Values
# Check for missing values in dataset
is.na(data)
# Count missing values per column
colSums(is.na(data))
b) Remove Missing Values
# Remove rows with any NA
data_clean <- na.omit(data)
c) Replace Missing Values
• Replace with Mean (numeric data)
data$Age[is.na(data$Age)] <- mean(data$Age, na.rm = TRUE)
• Replace with Median
data$Salary[is.na(data$Salary)] <- median(data$Salary, na.rm = TRUE)
• Replace with Mode (categorical data)
mode_value <- names(sort(table(data$Gender), decreasing=TRUE))[1]
data$Gender[is.na(data$Gender)] <- mode_value
d) Advanced Imputation (Optional)
library(mice)
# Multiple imputation for missing values
imputed_data <- mice(data, m=5, method='pmm', seed=123)
data_complete <- complete(imputed_data)
2. Outlier Detection
Meaning
• Outliers are data points significantly different from other observations.
• Can skew results and affect model performance.
Common Methods to Detect Outliers
a) Boxplot Method
# Visual detection
boxplot(data$Salary, main="Boxplot for Salary")
# Identify outlier values
outliers <- boxplot.stats(data$Salary)$out
outliers
b) Z-Score Method
# Calculate Z-scores
z_scores <- scale(data$Salary)
# Identify outliers (absolute Z-score > 3)
outliers <- data$Salary[abs(z_scores) > 3]
outliers
c) IQR (Interquartile Range) Method
Q1 <- quantile(data$Salary, 0.25)
Q3 <- quantile(data$Salary, 0.75)
IQR <- Q3 - Q1
# Detect outliers
outliers <- data$Salary[data$Salary < (Q1 - 1.5*IQR) | data$Salary > (Q3 + 1.5*IQR)]
outliers
Handling Outliers
1. Remove Outliers
data_no_outliers <- data[!(data$Salary %in% outliers), ]
2. Replace Outliers with Median
median_value <- median(data$Salary)
data$Salary[data$Salary %in% outliers] <- median_value
3. Transform Data
• Apply logarithmic or square root transformation to reduce impact of extreme values:
data$Salary_log <- log(data$Salary)
Summary Table
Task Method R Syntax / Function
Detect missing values Identify NA is.na(), colSums(is.na())
Handle missing values Remove na.omit()
Handle missing values Replace with mean/median/mode data$col[is.na()] <- mean/median/mode
Detect outliers Boxplot boxplot(), boxplot.stats()
Detect outliers Z-score scale(), abs(z_score) > 3
Detect outliers IQR method quantile(), conditional filtering
Handle outliers Remove or Replace data[!(data$col %in% outliers), ]
1. Data Transformation
Meaning
• Data transformation is the process of converting data into a suitable format for analysis.
• Helps in improving interpretability, reducing skewness, and preparing data for modeling.
• Common transformations: log, square root, reciprocal, power, or scaling.
R Examples
# Sample dataset
data <- data.frame(Value = c(10, 50, 200, 500, 1000))
# Log transformation
data$Log_Value <- log(data$Value)
# Square root transformation
data$Sqrt_Value <- sqrt(data$Value)
# Reciprocal transformation
data$Reciprocal <- 1 / data$Value
2. Normalization (Min-Max Scaling)
Meaning
• Normalization scales data to a fixed range, usually 0 to 1.
• Formula:
[
X_{norm} = \frac{X - X_{min}}{X_{max} - X_{min}}
]
• Useful when variables have different units or scales, especially in machine learning.
R Example
# Min-Max normalization
data$Normalized <- (data$Value - min(data$Value)) / (max(data$Value) - min(data$Value))
Result: All values are scaled between 0 and 1.
3. Standardization (Z-Score Scaling)
Meaning
• Standardization transforms data to have mean = 0 and standard deviation = 1.
• Formula:
[
X_{std} = \frac{X - \bar{X}}{SD}
]
• Useful when comparing variables with different scales or for machine learning algorithms like
SVM, KNN, or PCA.
R Example
# Z-score standardization
data$Standardized <- scale(data$Value)
# Check mean and sd
mean(data$Standardized) # Should be close to 0
sd(data$Standardized) # Should be 1
4. Summary Table
Method Purpose Formula R Syntax
Transformation Reduce skewness, improve Log, sqrt, log(x), sqrt(x), 1/x
interpretability reciprocal
Normalization Scale data to 0–1 (X - min)/(max - (x - min(x)) / (max(x) -
min) min(x))
Standardization Scale data to mean 0, SD 1 (X - mean)/SD scale(x)
UNIT – 4
1. Meaning of Predictive Analytics
• Predictive analytics is the process of using historical data to make predictions about future
events.
• It uses statistical, machine learning, and data mining techniques to forecast trends, behaviors, or
outcomes.
• Helps businesses in decision-making, risk management, and strategy planning.
2. Purpose of Predictive Analytics
1. Forecast sales, demand, or revenue.
2. Predict customer behavior or churn.
3. Detect fraud or risk.
4. Optimize operations or resources.
5. Support marketing strategies and personalized offers.
3. Common Predictive Analytics Techniques in R
Technique Purpose R Functions / Packages
Linear Regression Predict continuous outcomes lm(), caret
Logistic Regression Predict binary outcomes glm(family="binomial"), caret
Decision Trees Predict categorical or continuous rpart(), tree()
Random Forest Improve prediction using ensemble randomForest()
k-Nearest Neighbors (kNN) Classification and regression class::knn()
Support Vector Machines Classification & regression e1071::svm()
Time Series Forecasting Predict future values forecast::auto.arima(), prophet
1. Supervised Learning
Meaning
• Supervised learning is a type of machine learning where the model is trained on labeled data,
meaning each input comes with a known output.
• The algorithm learns to predict the output from input features.
Key Features
• Requires labeled dataset.
• Goal: Predict or classify outcomes.
• Feedback is provided during training.
• Commonly used for regression and classification problems.
Examples
R Functions / Packages
• lm() → Linear regression
• glm() → Logistic regression
• rpart() → Decision tree
• randomForest() → Random forest
2. Unsupervised Learning
Meaning
• Unsupervised learning is a type of machine learning where the data is unlabeled, meaning no known
output exists.
• The algorithm tries to find hidden patterns, structures, or groups in the data.
Key Features
• Works with unlabeled data.
• Goal: Discover patterns or clusters.
• No explicit feedback is provided.
• Commonly used for clustering and dimensionality reduction.
Examples
R Functions / Packages
• kmeans() → K-means clustering
• hclust() → Hierarchical clustering
• prcomp() → Principal Component Analysis (PCA)
3. Difference Table
Aspect Supervised Learning Unsupervised Learning
Data Labeled Unlabeled
Goal Predict or classify Find patterns or structure
Feedback Provided Not provided
Output Continuous (regression) or categorical Groups, clusters, patterns
(classification)
Examples Linear regression, Logistic regression, Decision tree K-means, Hierarchical clustering,
PCA
Evaluation Accuracy, RMSE, Precision, Recall Silhouette score, Davies-Bouldin
index
1. Regression Analytics
Meaning
• Regression analytics is a statistical technique used to study the relationship between a dependent
variable (outcome) and one or more independent variables (predictors).
• It helps in predicting outcomes, understanding relationships, and making data-driven decisions.
• Widely used in business, finance, economics, and healthcare for forecasting and trend analysis.
Purpose
1. Predict future values of the dependent variable.
2. Understand the effect of one or more independent variables.
3. Identify significant predictors influencing the outcome.
2. Simple Linear Regression
Meaning
• Simple linear regression is a type of regression where one independent variable (X) predicts one
dependent variable (Y).
• The relationship is assumed to be linear.
R Example
# Sample dataset
data <- data.frame(
Advertising = c(50, 60, 65, 70, 80),
Sales = c(200, 250, 300, 350, 400)
)
# Simple linear regression model
model <- lm(Sales ~ Advertising, data = data)
# Summary of the model
summary(model)
# Predict sales for new advertising budget
predict(model, newdata = data.frame(Advertising = c(55, 75)))
Interpretation
• Coefficient ((\beta_1)) → Change in Sales per unit increase in Advertising.
• R-squared → Proportion of variance in Sales explained by Advertising.
• p-value → Significance of predictor.
3. Multiple Linear Regression
Meaning
• Multiple linear regression is a type of regression where two or more independent variables are
used to predict a single dependent variable.
• Helps understand combined effect of multiple predictors.
R Example
# Sample dataset
data$Price <- c(10, 12, 11, 13, 14)
# Multiple linear regression model
model2 <- lm(Sales ~ Advertising + Price, data = data)
# Summary of the model
summary(model2)
Interpretation
• Each coefficient ((\beta_i)) → Effect of that predictor while holding others constant.
• Adjusted R-squared → Proportion of variance explained by all predictors together.
Summary Table
Type Number of Purpose R Function
Predictors
Simple Linear 1 Predict outcome from a single lm(Y ~ X)
Regression variable
Multiple Linear 2 or more Predict outcome from multiple lm(Y ~ X1 + X2 +
Regression variables ...)
1. Logistic Regression
Meaning
• Logistic regression is a statistical method used for binary classification problems, where the
dependent variable has two possible outcomes (e.g., Yes/No, 0/1).
• It models the probability of an event occurring using the logistic (sigmoid) function.
Key Features
• Predicts probabilities between 0 and 1.
• Can be extended to multinomial logistic regression for multiple classes.
• Assumes a linear relationship between independent variables and the log-odds of the outcome.
R Example
# Sample dataset
data <- data.frame(
Hours_Studied = c(2, 3, 5, 7, 1, 4),
Pass = c(0, 0, 1, 1, 0, 1)
)
# Logistic regression model
model_log <- glm(Pass ~ Hours_Studied, data = data, family = binomial)
# Summary
summary(model_log)
# Predict probability
predict(model_log, newdata = data.frame(Hours_Studied = c(6, 2)), type = "response")
2. Decision Tree
Meaning
• Decision Tree is a tree-like model used for classification and regression.
• Splits data into branches based on feature values to reach a decision at the leaves.
• Easy to interpret and visualize.
Key Features
• Can handle categorical and numerical data.
• Works well with non-linear relationships.
• Susceptible to overfitting; often combined with ensemble methods like Random Forest.
R Example
library(rpart)
# Sample dataset
data <- data.frame(
Age = c(25, 30, 45, 35, 50),
Income = c(50000, 60000, 80000, 70000, 90000),
Purchased = c("No", "No", "Yes", "Yes", "Yes")
)
# Build decision tree
tree_model <- rpart(Purchased ~ Age + Income, data = data, method = "class")
# Plot the tree
plot(tree_model)
text(tree_model, pretty = 0)
3. K-Nearest Neighbors (KNN)
Meaning
• KNN is a distance-based classification algorithm that assigns a class to a data point based on the
majority class of its k-nearest neighbors.
• Non-parametric and simple to understand.
Key Features
• No training phase; lazy learner.
• Works best with small to medium datasets.
• Sensitive to feature scaling (normalization recommended).
R Example
library(class)
# Sample dataset
train_data <- data.frame(
X1 = c(1, 2, 3, 6, 7, 8),
X2 = c(2, 3, 4, 7, 8, 9)
)
train_labels <- c("A", "A", "A", "B", "B", "B")
test_data <- data.frame(
X1 = c(4, 5),
X2 = c(5, 6)
)
# KNN classification (k = 3)
pred <- knn(train = train_data, test = test_data, cl = train_labels, k = 3)
pred
Comparison Table of Classification Techniques
Technique Type Dependent Advantages Limitations
Variable
Logistic Parametric Binary / Probabilities, Assumes linear log-
Regression Multinomial interpretable odds relationship
Decision Tree Non- Categorical / Easy visualization, Overfitting, sensitive to
parametric Continuous handles non-linearity small data changes
K-Nearest Non- Categorical Simple, no training Sensitive to scaling,
Neighbors (KNN) parametric required large datasets slow
1. Clustering Technique
Meaning
• Clustering is an unsupervised machine learning technique used to group similar data points
together based on their characteristics.
• The main goal is to identify hidden patterns or structures in the data without prior labels.
• Widely used in customer segmentation, market analysis, and anomaly detection.
Key Features
• Works with unlabeled data.
• Groups data points into clusters such that similar points are in the same cluster and dissimilar
points are in different clusters.
• No predefined output; patterns emerge from the data itself.
2. K-Means Clustering
Meaning
• K-Means clustering is a method where data is divided into K clusters.
• Each data point is assigned to the nearest cluster center (centroid).
• Iteratively updates centroids to minimize within-cluster variance.
R Example
# Sample dataset
data <- data.frame(
X = c(1, 2, 3, 8, 9, 10),
Y = c(2, 3, 4, 7, 8, 9)
)
# K-Means clustering with 2 clusters
set.seed(123)
kmeans_model <- kmeans(data, centers = 2)
# Cluster assignment
kmeans_model$cluster
# Cluster centers
kmeans_model$centers
Visualization
library(ggplot2)
data$Cluster <- as.factor(kmeans_model$cluster)
ggplot(data, aes(X, Y, color = Cluster)) + geom_point(size = 3)
3. Hierarchical Clustering
Meaning
• Hierarchical clustering builds a tree-like structure (dendrogram) showing nested groupings of data
points.
• Does not require specifying the number of clusters beforehand.
• Can be agglomerative (bottom-up) or divisive (top-down).
R Example
# Sample dataset
data <- data.frame(
X = c(1, 2, 3, 8, 9, 10),
Y = c(2, 3, 4, 7, 8, 9)
)
# Compute distance matrix
dist_matrix <- dist(data)
# Hierarchical clustering
hc_model <- hclust(dist_matrix, method = "complete")
# Plot dendrogram
plot(hc_model, main = "Hierarchical Clustering Dendrogram")
rect.hclust(hc_model, k = 2, border = "red") # Draw 2 clusters
1. Predictive Models for Sales Forecasting
Meaning
• Sales forecasting models predict future sales or demand based on historical sales data.
• Helps businesses plan inventory, optimize marketing, and manage resources effectively.
Common Predictive Models
Model Purpose R Functions / Packages
Linear Regression Predict continuous sales based on lm()
predictors (advertising, price, season)
Time Series Models Predict future sales trends over time forecast::auto.arima(),
forecast::ets(), prophet
Exponential Smoothing Smooth past data and forecast forecast::HoltWinters()
Random Forest Regression Handle complex nonlinear randomForest()
relationships
ARIMA (Auto-Regressive Forecast based on time series forecast::auto.arima()
Integrated Moving Average) patterns
R Example: Linear Regression for Sales
data <- data.frame(
Advertising = c(50, 60, 65, 70, 80),
Price = c(10, 12, 11, 13, 14),
Sales = c(200, 250, 300, 350, 400)
)
# Build linear regression model
model <- lm(Sales ~ Advertising + Price, data = data)
summary(model)
# Predict sales
predict(model, newdata = data.frame(Advertising = 75, Price = 12))
R Example: Time Series Forecasting
library(forecast)
sales_ts <- ts(c(200, 250, 300, 350, 400), frequency = 1)
model_arima <- auto.arima(sales_ts)
forecast(model_arima, h = 3) # Forecast next 3 periods
2. Predictive Models for Customer Segmentation
Meaning
• Customer segmentation divides customers into homogeneous groups based on behavior,
demographics, or purchase patterns.
• Helps in targeted marketing, personalized offers, and loyalty programs.
Common Predictive Models
R Example: K-Means for Customer Segmentation
data <- data.frame(
Age = c(25, 30, 45, 35, 50),
Income = c(50000, 60000, 80000, 70000, 90000)
)
# Apply K-Means with 2 clusters
set.seed(123)
kmeans_model <- kmeans(data, centers = 2)
# Cluster assignments
kmeans_model$cluster
Visualization
library(ggplot2)
UNIT – 5
1. Meaning of Text Mining
• Text mining (also called text data mining or text analytics) is the process of extracting useful
information, patterns, or insights from unstructured text data.
• It involves analyzing textual content from sources like documents, emails, social media posts, web
pages, and reviews.
• The goal is to convert unstructured text into structured data that can be analyzed statistically or
used in predictive models.
2. Key Features
1. Works with unstructured text data.
2. Uses techniques from Natural Language Processing (NLP), machine learning, and statistics.
3. Helps in identifying trends, sentiment, keywords, or topics in text.
4. Can be applied for classification, clustering, sentiment analysis, or recommendation systems.
Example in R
library(tm)
# Sample text data
text <- c("I love data analytics", "Text mining is useful", "R language is great for text analysis")
# Create a text corpus
corpus <- Corpus(VectorSource(text))
# Preprocessing: convert to lowercase
corpus <- tm_map(corpus, content_transformer(tolower))
# Remove punctuation
corpus <- tm_map(corpus, removePunctuation)
# Remove stopwords
corpus <- tm_map(corpus, removeWords, stopwords("en"))
# View cleaned text
inspect(corpus)
1. Text Mining Algorithms
Text mining uses various algorithms to extract meaningful information from unstructured text. The
choice of algorithm depends on the goal, such as classification, clustering, or topic extraction.
a) Bag-of-Words (BoW)
• Converts text into a matrix of word frequencies.
• Each document is represented as a vector of word counts.
• Useful for classification and clustering.
R Example:
library(tm)
text <- c("I love data analytics", "Text mining is useful")
corpus <- Corpus(VectorSource(text))
dtm <- DocumentTermMatrix(corpus)
inspect(dtm)
b) TF-IDF (Term Frequency – Inverse Document Frequency)
• Measures the importance of a word in a document relative to a corpus.
• Reduces the weight of common words like “the” and highlights important words.
R Example:
dtm_tfidf <- DocumentTermMatrix(corpus, control = list(weighting = weightTfIdf))
inspect(dtm_tfidf)
c) N-Grams
• Considers sequences of N words (e.g., bigrams = 2 words, trigrams = 3 words).
• Captures context and phrases in text.
R Example:
library(RWeka)
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
dtm_bigram <- DocumentTermMatrix(corpus, control = list(tokenize = BigramTokenizer))
inspect(dtm_bigram)
d) Topic Modeling (LDA - Latent Dirichlet Allocation)
• Uncovers hidden topics in a collection of documents.
• Each document is a mixture of topics, and each topic is a mixture of words.
R Example:
library(topicmodels)
lda_model <- LDA(dtm, k = 2) # 2 topics
terms(lda_model, 5) # top 5 terms per topic
e) Text Classification / Machine Learning
• Uses ML algorithms like Naive Bayes, SVM, Random Forest, or Deep Learning.
• Goal: classify text into categories (e.g., spam detection, sentiment).
R Example (Naive Bayes):
library(e1071)
# Assuming dtm_train is training DTM and labels_train are labels
model_nb <- naiveBayes(as.matrix(dtm_train), labels_train)
predict(model_nb, newdata = as.matrix(dtm_test))
2. Sentiment Analysis
Meaning
• Sentiment Analysis (or Opinion Mining) is the process of determining the emotional tone of a text.
• Classifies text into positive, negative, or neutral sentiments.
• Useful for customer feedback, social media monitoring, and brand analysis.
Common Approaches
1. Lexicon-Based Approach: Uses a predefined dictionary of positive and negative words.
2. Machine Learning Approach: Trains models (Naive Bayes, SVM) to classify sentiment based on
labeled data.
R Example: Lexicon-Based Sentiment Analysis
library(syuzhet)
# Sample text
text <- c("I love this product", "The service is terrible", "It is okay")
# Get sentiment scores
sentiment <- get_sentiment(text, method = "bing")
sentiment
Interpretation:
• Positive values → Positive sentiment
• Negative values → Negative sentiment
• Zero → Neutral sentiment
R Example: Visualizing Sentiment
library(ggplot2)
sent_df <- data.frame(Text=text, Sentiment=sentiment)
ggplot(sent_df, aes(x=Text, y=Sentiment, fill=Sentiment>0)) +
geom_bar(stat="identity") +
labs(title="Sentiment Analysis", x="Text", y="Sentiment Score")
1. caret (Classification And Regression Training)
Purpose
• caret is a comprehensive R package used for training and evaluating machine learning models.
• Provides a unified interface to multiple ML algorithms.
• Useful for data preprocessing, model tuning, and performance evaluation.
Key Features
1. Supports over 200 machine learning algorithms.
2. Provides data splitting, cross-validation, and hyperparameter tuning.
3. Offers preprocessing functions like normalization, scaling, and missing value imputation.
R Example
library(caret)
# Sample dataset
data(iris)
set.seed(123)
# Split data into training and testing
trainIndex <- createDataPartition(iris$Species, p=0.7, list=FALSE)
trainData <- iris[trainIndex, ]
testData <- iris[-trainIndex, ]
# Train a decision tree model using caret
model <- train(Species ~ ., data=trainData, method="rpart")
pred <- predict(model, testData)
confusionMatrix(pred, testData$Species)
2. e1071
Purpose
• e1071 provides tools for Support Vector Machines (SVM), Naive Bayes, and other statistical
learning methods.
• Widely used for classification and regression tasks.
Key Features
1. Implements SVM with different kernels.
2. Provides Naive Bayes classifier.
3. Includes functions for clustering and statistical computations.
R Example
library(e1071)
# SVM classification
model_svm <- svm(Species ~ ., data=trainData)
pred_svm <- predict(model_svm, testData)
table(pred_svm, testData$Species)
# Naive Bayes classification
model_nb <- naiveBayes(Species ~ ., data=trainData)
pred_nb <- predict(model_nb, testData)
table(pred_nb, testData$Species)
3. xgboost
Purpose
• xgboost is a high-performance package for gradient boosting, widely used for structured/tabular
data in competitions like Kaggle.
• Boosting combines multiple weak learners to create a strong predictive model.
Key Features
1. Fast and scalable gradient boosting implementation.
2. Handles missing values internally.
3. Supports regression, classification, and ranking.
4. Offers feature importance and regularization to avoid overfitting.
R Example
library(xgboost)
# Prepare data
train_matrix <- xgb.DMatrix(data = as.matrix(trainData[, -5]), label =
as.numeric(trainData$Species)-1)
test_matrix <- xgb.DMatrix(data = as.matrix(testData[, -5]), label = as.numeric(testData$Species)-1)
# Train XGBoost model
model_xgb <- xgboost(data = train_matrix, max.depth = 3, eta = 0.1, nrounds = 50, objective =
"multi:softmax", num_class = 3)
# Predict
pred_xgb <- predict(model_xgb, test_matrix)
table(pred_xgb, as.numeric(testData$Species)-1)
4. randomForest
Purpose
• randomForest is used for ensemble learning, combining multiple decision trees to improve
prediction accuracy.
• Suitable for classification and regression problems.
Key Features
1. Handles large datasets with high dimensionality.
2. Provides feature importance measures.
3. Reduces overfitting compared to a single decision tree.
R Example
library(randomForest)
# Train Random Forest model
model_rf <- randomForest(Species ~ ., data=trainData, ntree=100)
pred_rf <- predict(model_rf, testData)
table(pred_rf, testData$Species)
# Feature importance
importance(model_rf)
varImpPlot(model_rf)