Unit 2: Know Data Concepts
1.1. Dispersion:
1) Range: Range measures the spread of a dataset by
calculating the difference between the largest and smallest
values.
Formula:
Range = Maximum Value - Minimum Value
Example:
Dataset: 5, 10, 15, 20, 25
Range = 25 – 5 = 20
2) Quartiles: Quartiles divide the dataset into four equal parts
after sorting it in ascending order,
a) Q1 (1st Quartile): The median of the lower half of data.
b) Q2 (2nd Quartile): The median of the entire dataset.
c) Q3 (3rd Quartile): The median of the upper half of data.
Interquartile Range (IQR): Measures the spread of the
middle 50% of data, calculated as:
IQR = Q3 - Q1
Example:
Dataset: 4, 8, 15, 16, 23, 42, 50
Q1 = 8, Q2 (median) = 16, Q3 = 42
1
IQR = 42 – 8 = 34
3) Variance: Variance measures how much each data point
deviates from the mean, averaged over the dataset.
Formula:
𝑛
2
∑𝑖=1(𝑥𝑖 −𝑥̅ )2
Variance(σ ) =
𝑛
Where:
𝑥𝑖 = individual data points,
𝑥̅ = mean,
n = number of data points.
Example:
Dataset: 2, 4, 6
𝑥̅ = 4,
(2 − 4)2 +(4 − 4)2 +(6 − 4)2
Variance =
3
Variance = 2.67.
4) Standard Deviation (SD): SD is the square root of variance,
representing the average distance from the mean in the
original data units.
Formula:
SD(σ) = √Variance
2
Example:
Dataset: 2, 4, 6
Variance = 2.67,
Standard Deviation = √2.67 = 1.63.
5) Data Using Interquartile Range (IQR): IQR focuses on the
central portion of data, ignoring extreme values (outliers). It's
robust and less sensitive to outliers compared to range.
Detecting Outliers:
Outliers lie outside,
Lower Bound = Q1 − 1.5 × IQR
Upper Bound = Q3 + 1.5 × IQR
Example:
Dataset: 4, 8, 15, 16, 23, 42, 50
Q1 = 8, Q3 = 42,
IQR = 42 – 8 = 34.
Lower Bound = 8 - 1.5(34) = −43,
Upper Bound = 42 + 1.5(34) = 93.
Data points 4, and 50 are within bounds, so there are no
outliers.
3
6) WEKA: WEKA (Waikato Environment for Knowledge
Analysis) is a machine learning tool for data preprocessing,
visualization, and applying machine learning algorithms.
Steps to Analyse Dispersion in WEKA:
a) Load Dataset: Open WEKA and load your dataset in
ARFF, CSV, or other supported formats.
b) Explore Summary Statistics:
i. Go to the Preprocess tab.
ii. Click on the dataset to see summary statistics for
attributes,
Min, Max, Mean, Standard Deviation, and more.
c) Filter for Quartiles or IQR:
i. Use filters like NumericToNominal for binning data
into quartiles.
ii. Use the Interquartile Range filter to detect and
handle outliers.
d) Visualize Data: Use the Visualize tab to plot boxplots or
scatter plots that show dispersion, quartiles, and outliers.
e) Export Results: Export pre-processed data or statistical
outputs for further analysis.