EXPLORATORY DATA ANALYSIS
Exploratory data analysis is an analysis
technique to analyze and investigate the data set
and summarize the main characteristics of the
dataset in a visualized form.
Exploratory data analysis is a set of
techniques that have been principally developed by
American Mathematician John Tukey 1970 .
The analysis of datasets based on various
numerical methods and graphical tools.
VISUALIZATION TECHNIQUES
Visualization techniques are essential in
Exploratory Data Analysis (EDA), helps to
understand and communicate insights. Here are
some common techniques:
1. Scatter Plots
2. Pie Charts
3. Bar Charts
4. Histograms
5. Line Plots
6. Heatmaps
7. Violin Plots
8. Distribution Plots
CONFRONTING NEW DATA SET
It refers to the initial encounter and examination of
a previous unseen data collection.
Answer the basic questions:
Learn about the data's origin, purpose, and
any relevant background information.
Who constructed this data set, when, and
why?
Eg: National Health and Nutrition Examination Survey
2009-2010
How big is it?
Eg: The data set has 4978 records each with seven data
fields.
What do the fields mean?
Eg: The lengths and weights were measured using the
metric system (i.e) centimetres and kilograms
respectively.
LOOK FOR FAMILIAR OR INTERPRETABLE
RECORDS:
It refers to the process of identifying data points,
patterns, or features that are:
[Link]: Easily identifiable due to prior
knowledge or experience.
[Link]: Can be comprehended
without extensive additional explanation.
[Link]: Connected to existing knowledge or
real-world concepts.
[Link]: Possess a clear interpretation
or significance.
SUMMARY STATISTICS:
Summary statistics refers to a numerical
measures that summarizes and describes the
basic features of a data set.
Common summary statistics used in EDA include:
[Link] of central tendency:
• Mean, Median, Mode
[Link] of variability:
• Range, Variance, Standard deviation
[Link] of distribution:
• Skewness, Kurtosis, Quantiles
[Link] and frequencies:
• Number of observations (n), Missing values,
Frequency distributions
PAIRWISE CORRELATION
It refers to the process of calculating and
analyzing the correlation coefficients between all
possible pairs of variables in a dataset.
[Link] relationships: Discover how
variables are related to each other.
[Link] interactions: Reveal how
changes in one variable affect others.
[Link] patterns: Uncover hidden patterns
and correlations.
[Link] feature selection: Inform the
selection of relevant variables for
further analysis.
CLASS BREAKDOWNS:
Class breakdowns in Exploratory Data Analysis (EDA)
refer to the process of analyzing and summarizing the
distribution of categorical variables, also known as class
variables or target variables.
They Involves :
[Link] unique classes: Determining the
distinct categories or groups within a categorical
variable.
[Link] observations: Calculating the number of
observations (records) in each class.
[Link] frequencies: Determining the
proportion or percentage of observations in each class.
[Link] distributions: Using plots like bar
charts, pie charts, or histograms to illustrate the class
breakdowns.
PLOT OF DISTRIBUTIONS
Plotting distributions is a crucial step in EDA to
understand the shape, central tendency, and
variability of your data.
Common distribution plots in EDA include:
1. Histograms: Visualize frequency distributions.
2. Box Plots: Show median, quartiles, and outliers.
3. Density Plots: Smooth, continuous
representations.
4. Q-Q Plots(Quantile-Quantile):Compare
distributions.
5. Violin Plots: Combine box plots and density
Example:
| Customer ID | Age | Gender | Income | Purchase Amount |
| 1 | 25 | Male | 50000 | 100 |
| 2 | 31 | Female | 60000 | 200 |
| 3 | 42 | Male | 70000 | 50 |
EDA Steps:
[Link] Statistics:
- Mean Age: 35
-Median Income: 55000
- Average Purchase Amount: 150
2. Data Visualization:
- Histogram of Age: skewed to the right
- Scatter plot of Income vs. Purchase Amount: positive correlation
- Bar chart of Gender vs. Purchase Amount: males spend more
3. Familiar or Interpretable Records:
- Customers with high income (>75000) tend to make
larger purchases
- Males aged 25-40 have higher purchase amounts
4. Pairwise Correlation:
- Strong correlation between Income and Purchase Amount
(0.8) - Moderate correlation between Age and Income
(0.5)
5. Class Breakdown:
- 60% of customers are males
- 40% of customers have income above 60000