0% found this document useful (0 votes)
17 views4 pages

Data Dispersion Concepts Guide

Unit 2 covers data concepts related to dispersion, including range, quartiles, variance, and standard deviation, with formulas and examples provided. It emphasizes the importance of the interquartile range (IQR) for robust data analysis and outlier detection. Additionally, it introduces WEKA as a tool for data preprocessing and visualization, outlining steps for analyzing dispersion using the software.

Uploaded by

sakshiiiur9255
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views4 pages

Data Dispersion Concepts Guide

Unit 2 covers data concepts related to dispersion, including range, quartiles, variance, and standard deviation, with formulas and examples provided. It emphasizes the importance of the interquartile range (IQR) for robust data analysis and outlier detection. Additionally, it introduces WEKA as a tool for data preprocessing and visualization, outlining steps for analyzing dispersion using the software.

Uploaded by

sakshiiiur9255
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Unit 2: Know Data Concepts

1.1. Dispersion:
1) Range: Range measures the spread of a dataset by
calculating the difference between the largest and smallest
values.

Formula:

Range = Maximum Value - Minimum Value

Example:

Dataset: 5, 10, 15, 20, 25

Range = 25 – 5 = 20

2) Quartiles: Quartiles divide the dataset into four equal parts


after sorting it in ascending order,
a) Q1 (1st Quartile): The median of the lower half of data.
b) Q2 (2nd Quartile): The median of the entire dataset.
c) Q3 (3rd Quartile): The median of the upper half of data.

Interquartile Range (IQR): Measures the spread of the


middle 50% of data, calculated as:

IQR = Q3 - Q1

Example:

Dataset: 4, 8, 15, 16, 23, 42, 50

Q1 = 8, Q2 (median) = 16, Q3 = 42

1
IQR = 42 – 8 = 34

3) Variance: Variance measures how much each data point


deviates from the mean, averaged over the dataset.

Formula:
𝑛
2
∑𝑖=1(𝑥𝑖 −𝑥̅ )2
Variance(σ ) =
𝑛

Where:

𝑥𝑖 = individual data points,

𝑥̅ = mean,

n = number of data points.

Example:

Dataset: 2, 4, 6

𝑥̅ = 4,

(2 − 4)2 +(4 − 4)2 +(6 − 4)2


Variance =
3

Variance = 2.67.

4) Standard Deviation (SD): SD is the square root of variance,


representing the average distance from the mean in the
original data units.

Formula:

SD(σ) = √Variance

2
Example:

Dataset: 2, 4, 6

Variance = 2.67,

Standard Deviation = √2.67 = 1.63.

5) Data Using Interquartile Range (IQR): IQR focuses on the


central portion of data, ignoring extreme values (outliers). It's
robust and less sensitive to outliers compared to range.

Detecting Outliers:

Outliers lie outside,

Lower Bound = Q1 − 1.5 × IQR

Upper Bound = Q3 + 1.5 × IQR

Example:

Dataset: 4, 8, 15, 16, 23, 42, 50

Q1 = 8, Q3 = 42,

IQR = 42 – 8 = 34.

Lower Bound = 8 - 1.5(34) = −43,

Upper Bound = 42 + 1.5(34) = 93.

Data points 4, and 50 are within bounds, so there are no


outliers.

3
6) WEKA: WEKA (Waikato Environment for Knowledge
Analysis) is a machine learning tool for data preprocessing,
visualization, and applying machine learning algorithms.

Steps to Analyse Dispersion in WEKA:

a) Load Dataset: Open WEKA and load your dataset in


ARFF, CSV, or other supported formats.
b) Explore Summary Statistics:
i. Go to the Preprocess tab.
ii. Click on the dataset to see summary statistics for
attributes,
Min, Max, Mean, Standard Deviation, and more.
c) Filter for Quartiles or IQR:
i. Use filters like NumericToNominal for binning data
into quartiles.
ii. Use the Interquartile Range filter to detect and
handle outliers.
d) Visualize Data: Use the Visualize tab to plot boxplots or
scatter plots that show dispersion, quartiles, and outliers.
e) Export Results: Export pre-processed data or statistical
outputs for further analysis.

You might also like