0% found this document useful (0 votes)
32 views6 pages

Module 5

Uploaded by

goaltracker38
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views6 pages

Module 5

Uploaded by

goaltracker38
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Module 5: R Programming for Data Analysis and

Visualization

5.1 Introduction to R Programming

R is an open-source programming language designed for statistical computing, data analysis, and
visualization. It provides a wide array of statistical and graphical techniques, making it a popular tool
among data analysts, statisticians, and researchers.

Key Features:

• Extensive statistical functions (mean, median, regression, etc.)

• Data manipulation and cleaning capabilities

• Powerful visualization libraries (ggplot2, lattice, base graphics)

• Support for various data formats (CSV, Excel, SQL, JSON)

• Open-source with a large community and package ecosystem

Usefulness in Analytics:
R allows analysts to import, clean, explore, and visualize data efficiently. Its statistical computing
capabilities make it ideal for hypothesis testing, predictive modeling, and advanced analytics.

5.2 Importing and Exporting Data in R

R provides functions to read and write data from multiple formats:

• Importing CSV: [Link]("[Link]")


Example:

• data <- [Link]("[Link]")

• head(data)

• Exporting CSV: [Link](data, "[Link]")


Example:

• [Link](data, "cleaned_data.csv")

Step-by-Step Process:

1. Locate the file path or set working directory using setwd().

2. Use [Link]() to load the dataset into R.


3. Check the dataset using head() or str().

4. Perform analysis or cleaning.

5. Export processed data using [Link]().

Example Application: Importing sales data, cleaning missing values, and exporting the cleaned dataset
for visualization.

5.3 Data Types and Attributes in R

R supports multiple data types:

• Numeric: Stores numbers (e.g., 12, 45.6)

• Character: Stores text (e.g., "R Programming")

• Factor: Categorical data (e.g., "Male", "Female")

• Logical: TRUE or FALSE

Example:

age <- c(23, 25, 30) # Numeric

gender <- factor(c("M","F","M")) # Factor

name <- c("Amit","Riya","Karan") # Character

is_student <- c(TRUE,FALSE,TRUE) # Logical

Attributes: Include names, class, dimensions, and levels (for factors).


Example: class(age) returns "numeric", levels(gender) returns "M","F".

Importance: Proper data typing is crucial for statistical analysis, visualizations, and function
compatibility.

5.4 Basic Arithmetic Operations in R

R can perform arithmetic operations on variables and vectors:

x <- 10

y <- 5

sum <- x + y # 15

difference <- x - y # 5

product <- x * y # 50

quotient <- x / y # 2
power <- x^2 # 100

Explanation: Each operator (+, -, *, /, ^) performs standard mathematical calculations. R can also operate
element-wise on vectors.

5.5 Descriptive Statistics in R

Descriptive statistics summarize dataset characteristics:

• Mean: mean(data$column)

• Median: median(data$column)

• Standard Deviation: sd(data$column)

• Summary: summary(data) returns min, max, median, mean, and quartiles.

Example:

scores <- c(80, 75, 90, 85, 95)

mean(scores) # 85

median(scores) # 85

sd(scores) # 7.9057

Use in Analytics: Helps understand central tendency, spread, and variability before visualization.

5.6 Handling Missing Values

Dirty data (incomplete, inconsistent, or missing) affects analysis and visualization.

Techniques to Handle Missing Values in R:

• Identify missing values: [Link](data$column)

• Remove missing values: [Link](data)

• Replace missing values: data$column[[Link](data$column)] <- mean(data$column, [Link]=TRUE)

Example:

data <- c(10, NA, 15, 20, NA)

data[[Link](data)] <- mean(data, [Link]=TRUE)

Importance: Cleaning ensures accurate statistical results and reliable visualizations.


5.7 Exploratory Data Analysis (EDA)

EDA involves exploring data to understand patterns, distributions, and relationships before formal
modeling.

Key Steps in EDA:

1. Inspect structure: str(data)

2. Summarize variables: summary(data)

3. Visualize distributions: histograms, boxplots, density plots

4. Explore relationships: scatter plots, correlation matrices

Example: Plotting the distribution of sales data to detect skewness or outliers.

Importance: EDA helps identify anomalies, trends, and relationships that guide further analysis and
reporting.

5.8 Visualization Techniques in R

5.8.1 Single Variable Visualization

• Histogram: Shows frequency distribution.

• hist(data$scores, main="Score Distribution", xlab="Scores", col="blue")

• Boxplot: Detects outliers and spread.

• boxplot(data$scores, main="Score Spread")

• Density Plot: Smooth estimate of data distribution.

• plot(density(data$scores), main="Density Plot")

5.8.2 Multi-variable Visualization

• Scatter Plot: Shows relationship between two numeric variables.

• plot(data$age, data$score, main="Age vs Score", xlab="Age", ylab="Score")

• Correlation Analysis: Quantifies relationship.

• cor(data$age, data$score) # e.g., 0.85 indicates strong positive correlation

Difference Between Exploration and Presentation:

• Exploration: Understand patterns, anomalies, and distributions.

• Presentation: Clean, publication-ready charts for decision-making.


5.9 Benefits and Limitations of R for Visualization

Benefits:

• Extensive plotting libraries and customization.

• Handles large datasets efficiently.

• Integrates seamlessly with statistical functions for analysis.

Limitations:

• Steep learning curve for beginners.

• Requires coding skills for advanced visualization.

• Rendering complex graphics can be slower with very large datasets.

5.10 Summary of Key Functions

Function Purpose Example

[Link]() Import CSV file data <- [Link]("[Link]")

[Link]() Export CSV file [Link](data, "[Link]")

str() Display structure str(data)

summary() Summary stats summary(data$score)

mean() Mean mean(data$score)

median() Median median(data$score)

sd() Standard deviation sd(data$score)

hist() Histogram hist(data$score)

boxplot() Box plot boxplot(data$score)

plot() Scatter plot plot(data$age, data$score)

cor() Correlation cor(data$age, data$score)


5.11 Quick Notes

Concept Explanation Example

Numeric Data Type Stores numbers 10, 25.5

Factor Data Type Categorical data "Male", "Female"

Convert Numeric to
[Link]() data$group <- [Link](data$group)
Factor

Distribution of a single numeric


Histogram hist(data$score)
variable

Boxplot Detect outliers boxplot(data$score)

Density Plot Smoothed distribution plot(density(data$score))

Missing Values NA values affecting analysis [Link](data$score)

data[[Link](data)] <- mean(data,


Handle Missing Values Remove or replace
[Link]=TRUE)

Relationship between two numeric


Scatter Plot plot(data$age, data$score)
variables

Correlation Strength of linear relationship cor(data$age, data$score)

You might also like