0% found this document useful (0 votes)
18 views7 pages

Basic Statistics

Descriptive statistics summarize and organize data to reveal its main features, using measures such as central tendency (mean, median, mode) and dispersion (range, variance, standard deviation). It also includes concepts like skewness and kurtosis to describe data distribution, which is essential for exploratory data analysis. Understanding these statistics is crucial for effective data analysis and model selection.

Uploaded by

Hanish Manikanta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views7 pages

Basic Statistics

Descriptive statistics summarize and organize data to reveal its main features, using measures such as central tendency (mean, median, mode) and dispersion (range, variance, standard deviation). It also includes concepts like skewness and kurtosis to describe data distribution, which is essential for exploratory data analysis. Understanding these statistics is crucial for effective data analysis and model selection.

Uploaded by

Hanish Manikanta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

1.3.1.

Descriptive Statistics

Descriptive statistics is the branch of statistics that focuses on summarizing and organizing data in a
meaningful way. It helps us understand the main features of a dataset by providing simple
summaries about the sample and its observations. These summaries can be quantitative (e.g.,
numbers like mean or standard deviation) or visual (e.g., graphs and charts). The goal is to describe
the characteristics of a dataset without drawing conclusions beyond the data itself or making
inferences about the population.

Measures of Central Tendency

Measures of central tendency are single values that attempt to describe a set of data by identifying
the central position within that set of data. They are often called "averages."

Median

The median is the middle value in a dataset when the values are arranged in ascending or
descending order. If the dataset has an even number of observations, the median is the average of
the two middle values.

• Use Case: Ideal for skewed distributions or data with outliers, as it is robust to extreme
values.

• Process:

1. Arrange data in order.

2. If n is odd, the median is the middle value.

3. If n is even, the median is the average of the two middle values.

Examples: For the dataset [10, 20, 30, 40, 50] (odd n): Ordered: [10, 20, **30**, 40, 50] Median = 30
For the dataset [10, 20, 30, 40] (even n): Ordered: [10, **20, 30**, 40] Median = (20+30)/2=25

Mode

The mode is the value that appears most frequently in a dataset. A dataset can have one mode
(unimodal), multiple modes (multimodal), or no mode if all values appear with the same frequency.

• Use Case: Useful for categorical or discrete data. Can be used for numerical data as well.

• Limitation: Not always unique or informative for continuous data.

Examples: For the dataset [10, 20, 20, 30, 40]: Mode = 20 (appears twice)

For the dataset [10, 20, 20, 30, 30, 40]: Modes = 20 and 30 (bimodal)

For the dataset [10, 20, 30, 40]: No Mode (all values appear once)

Measures of Dispersion

Measures of dispersion (or variability) describe how spread out or scattered the data points are
around the central tendency. They provide information about the range of the data, helping us
understand the diversity or consistency within a dataset.

Range

The range is the simplest measure of dispersion, calculated as the difference between the maximum
and minimum values in a dataset.

• Formula: Range = Maximum Value - Minimum Value

• Use Case: Quick and easy to calculate.

• Limitation: Highly sensitive to outliers; only considers the two extreme values.

Example: For the dataset [10, 20, 30, 40, 50]: Range = 50−10=40
• Use Case: Provides a measure of overall data dispersion.

• Limitation: The units of variance are squared, which can make it hard to interpret in the
context of the original data.

Standard Deviation

The standard deviation is the square root of the variance. It is the most commonly used measure of
dispersion because it is expressed in the same units as the original data, making it more
interpretable than variance.
Quartiles and Interquartile Range (IQR)

Quartiles divide a dataset into four equal parts.

• Q1 (First Quartile / Lower Quartile): The value below which 25% of the data falls (the
median of the lower half of the data).

• Q2 (Second Quartile / Median): The value below which 50% of the data falls (the overall
median).

• Q3 (Third Quartile / Upper Quartile): The value below which 75% of the data falls (the
median of the upper half of the data).

The Interquartile Range (IQR) is the range of the middle 50% of the data. It's the difference between
the third quartile (Q3) and the first quartile (Q1).

• Formula: IQR = Q3 - Q1

• Use Case: A robust measure of dispersion, less affected by outliers than the range, often
used for outlier detection.

• Visualization: Often represented using box plots.

Example: For the dataset [10, 15, 20, 25, 30, 35, 40]

1. Order the data: [10, 15, 20, 25, 30, 35, 40]

2. Median (Q2) = 25

3. Lower half: [10, 15, 20]. Q1 = 15

4. Upper half: [30, 35, 40]. Q3 = 35

5. IQR = 35−15=20
Data Distribution

Understanding data distribution is crucial for choosing appropriate statistical methods and machine
learning models. It describes the shape of the data when plotted.

Skewness

Skewness is a measure of the asymmetry of the probability distribution of a real-valued random


variable about its mean. It indicates the direction and magnitude of a distribution's deviation from
the horizontal symmetry.

• Positive Skew (Right-Skewed):

o The "tail" of the distribution is longer on the right side.

o Mean > Median > Mode.

o Common in datasets where there's a lower bound (e.g., incomes, house prices, test
scores where many people score low and a few score very high).
• Negative Skew (Left-Skewed):

o The "tail" of the distribution is longer on the left side.

o Mean < Median < Mode.

o Common in datasets where there's an upper bound (e.g., exam scores where most
people score high, but a few score very low).

• Zero Skew (Symmetrical):

o The distribution is symmetrical around its mean.

o Mean = Median = Mode (e.g., Normal Distribution).

• Interpretation:

o Skewness = 0: Perfectly symmetrical.

o Skewness > 0: Positively skewed.

o Skewness < 0: Negatively skewed.

o Absolute value of skewness: indicates the degree of asymmetry. Generally, a


skewness value between -0.5 and 0.5 is considered roughly symmetrical.
Kurtosis

Kurtosis measures the "tailedness" of the probability distribution of a real-valued random variable. It
describes the shape of the tails of the distribution relative to the tails of a normal distribution (which
has a kurtosis of 3, or an excess kurtosis of 0). It tells us about the presence of outliers.

• Leptokurtic (Positive Kurtosis / High Kurtosis):

o Has fatter tails and a sharper peak than a normal distribution.

o Indicates more extreme outliers (more observations in the tails).

• Platykurtic (Negative Kurtosis / Low Kurtosis):

o Has thinner tails and a flatter peak than a normal distribution.

o Indicates fewer or less extreme outliers.

• Mesokurtic (Kurtosis = 0 or Excess Kurtosis = 0):

o Has tails similar to a normal distribution.

o The normal distribution itself is mesokurtic.

• Interpretation (often using Excess Kurtosis, where Normal Distribution = 0):

o Excess Kurtosis = 0: Mesokurtic (like a normal distribution).

o Excess Kurtosis > 0: Leptokurtic (more pointed peak, fatter tails, more outliers).

o Excess Kurtosis < 0: Platykurtic (flatter peak, thinner tails, fewer outliers).

Understanding these descriptive statistics is foundational for exploratory data analysis (EDA), helping
you quickly grasp the characteristics of your data before applying more advanced analytical
techniques or building machine learning models.

You might also like