Question 1:
Exploratory Data Analysis (EDA) on dataset, checking out the missing values, computing basic
statistics and visualizing the distributions. We will start by loading the dataset and summarizing it.
The dataset consists of 100 entries with 5 numerical columns:
● Customer ID (Unique identifier)
● Age
● Annual Income
● Spending Score
● Churn (1 = Churned, 0 = Not Churned)
There are no missing values.
All columns are in integer type.
Statistics for numerical data:
● Age: Ranges from 22 to 50.
Mean value = 34.1.
● Annual Income: Ranges from $20,000 to $100,000
Average = $53,000.
● Spending Score: Ranges from 30 to 90, average value is 64.5
● Churn Rate: 40% of customers churned.
Histogram Visualization of the distributions of Age, Annual Income and Spending Score
Main conclusions arrived:
● Age: Mostly between 25 and 45 and 30-35 age is at slightly high.
● Annual Income: Distributed widely from $30,000 to $70,000.
● Spending Score: Most customers have a spending score between 50 and 80.
Step 1: Load the Dataset and Perform EDA
CopyEdit
# Load necessary libraries
library(ggplot2)
library(dplyr)
library(cluster)
library(caret)
# Read the dataset
df <- read.csv("Customer_Segmentation_and_Churn.csv")
# View basic information
str(df)
summary(df)
# Check for missing values
colSums(is.na(df))
# Visualize distributions
ggplot(df, aes(x=Age)) + geom_histogram(binwidth=5, fill="blue", color="black") + ggtitle("Age
Distribution")
ggplot(df, aes(x=AnnualIncome)) + geom_histogram(binwidth=5000, fill="green", color="black") +
ggtitle("Annual Income Distribution")
ggplot(df, aes(x=SpendingScore)) + geom_histogram(binwidth=5, fill="red", color="black") +
ggtitle("Spending Score Distribution")