Module 5: R Programming for Data Analysis and
Visualization
5.1 Introduction to R Programming
R is an open-source programming language designed for statistical computing, data analysis, and
visualization. It provides a wide array of statistical and graphical techniques, making it a popular tool
among data analysts, statisticians, and researchers.
Key Features:
• Extensive statistical functions (mean, median, regression, etc.)
• Data manipulation and cleaning capabilities
• Powerful visualization libraries (ggplot2, lattice, base graphics)
• Support for various data formats (CSV, Excel, SQL, JSON)
• Open-source with a large community and package ecosystem
Usefulness in Analytics:
R allows analysts to import, clean, explore, and visualize data efficiently. Its statistical computing
capabilities make it ideal for hypothesis testing, predictive modeling, and advanced analytics.
5.2 Importing and Exporting Data in R
R provides functions to read and write data from multiple formats:
• Importing CSV: [Link]("[Link]")
Example:
• data <- [Link]("[Link]")
• head(data)
• Exporting CSV: [Link](data, "[Link]")
Example:
• [Link](data, "cleaned_data.csv")
Step-by-Step Process:
1. Locate the file path or set working directory using setwd().
2. Use [Link]() to load the dataset into R.
3. Check the dataset using head() or str().
4. Perform analysis or cleaning.
5. Export processed data using [Link]().
Example Application: Importing sales data, cleaning missing values, and exporting the cleaned dataset
for visualization.
5.3 Data Types and Attributes in R
R supports multiple data types:
• Numeric: Stores numbers (e.g., 12, 45.6)
• Character: Stores text (e.g., "R Programming")
• Factor: Categorical data (e.g., "Male", "Female")
• Logical: TRUE or FALSE
Example:
age <- c(23, 25, 30) # Numeric
gender <- factor(c("M","F","M")) # Factor
name <- c("Amit","Riya","Karan") # Character
is_student <- c(TRUE,FALSE,TRUE) # Logical
Attributes: Include names, class, dimensions, and levels (for factors).
Example: class(age) returns "numeric", levels(gender) returns "M","F".
Importance: Proper data typing is crucial for statistical analysis, visualizations, and function
compatibility.
5.4 Basic Arithmetic Operations in R
R can perform arithmetic operations on variables and vectors:
x <- 10
y <- 5
sum <- x + y # 15
difference <- x - y # 5
product <- x * y # 50
quotient <- x / y # 2
power <- x^2 # 100
Explanation: Each operator (+, -, *, /, ^) performs standard mathematical calculations. R can also operate
element-wise on vectors.
5.5 Descriptive Statistics in R
Descriptive statistics summarize dataset characteristics:
• Mean: mean(data$column)
• Median: median(data$column)
• Standard Deviation: sd(data$column)
• Summary: summary(data) returns min, max, median, mean, and quartiles.
Example:
scores <- c(80, 75, 90, 85, 95)
mean(scores) # 85
median(scores) # 85
sd(scores) # 7.9057
Use in Analytics: Helps understand central tendency, spread, and variability before visualization.
5.6 Handling Missing Values
Dirty data (incomplete, inconsistent, or missing) affects analysis and visualization.
Techniques to Handle Missing Values in R:
• Identify missing values: [Link](data$column)
• Remove missing values: [Link](data)
• Replace missing values: data$column[[Link](data$column)] <- mean(data$column, [Link]=TRUE)
Example:
data <- c(10, NA, 15, 20, NA)
data[[Link](data)] <- mean(data, [Link]=TRUE)
Importance: Cleaning ensures accurate statistical results and reliable visualizations.
5.7 Exploratory Data Analysis (EDA)
EDA involves exploring data to understand patterns, distributions, and relationships before formal
modeling.
Key Steps in EDA:
1. Inspect structure: str(data)
2. Summarize variables: summary(data)
3. Visualize distributions: histograms, boxplots, density plots
4. Explore relationships: scatter plots, correlation matrices
Example: Plotting the distribution of sales data to detect skewness or outliers.
Importance: EDA helps identify anomalies, trends, and relationships that guide further analysis and
reporting.
5.8 Visualization Techniques in R
5.8.1 Single Variable Visualization
• Histogram: Shows frequency distribution.
• hist(data$scores, main="Score Distribution", xlab="Scores", col="blue")
• Boxplot: Detects outliers and spread.
• boxplot(data$scores, main="Score Spread")
• Density Plot: Smooth estimate of data distribution.
• plot(density(data$scores), main="Density Plot")
5.8.2 Multi-variable Visualization
• Scatter Plot: Shows relationship between two numeric variables.
• plot(data$age, data$score, main="Age vs Score", xlab="Age", ylab="Score")
• Correlation Analysis: Quantifies relationship.
• cor(data$age, data$score) # e.g., 0.85 indicates strong positive correlation
Difference Between Exploration and Presentation:
• Exploration: Understand patterns, anomalies, and distributions.
• Presentation: Clean, publication-ready charts for decision-making.
5.9 Benefits and Limitations of R for Visualization
Benefits:
• Extensive plotting libraries and customization.
• Handles large datasets efficiently.
• Integrates seamlessly with statistical functions for analysis.
Limitations:
• Steep learning curve for beginners.
• Requires coding skills for advanced visualization.
• Rendering complex graphics can be slower with very large datasets.
5.10 Summary of Key Functions
Function Purpose Example
[Link]() Import CSV file data <- [Link]("[Link]")
[Link]() Export CSV file [Link](data, "[Link]")
str() Display structure str(data)
summary() Summary stats summary(data$score)
mean() Mean mean(data$score)
median() Median median(data$score)
sd() Standard deviation sd(data$score)
hist() Histogram hist(data$score)
boxplot() Box plot boxplot(data$score)
plot() Scatter plot plot(data$age, data$score)
cor() Correlation cor(data$age, data$score)
5.11 Quick Notes
Concept Explanation Example
Numeric Data Type Stores numbers 10, 25.5
Factor Data Type Categorical data "Male", "Female"
Convert Numeric to
[Link]() data$group <- [Link](data$group)
Factor
Distribution of a single numeric
Histogram hist(data$score)
variable
Boxplot Detect outliers boxplot(data$score)
Density Plot Smoothed distribution plot(density(data$score))
Missing Values NA values affecting analysis [Link](data$score)
data[[Link](data)] <- mean(data,
Handle Missing Values Remove or replace
[Link]=TRUE)
Relationship between two numeric
Scatter Plot plot(data$age, data$score)
variables
Correlation Strength of linear relationship cor(data$age, data$score)