The Data Analyst's Guide to
Data Types, Distributions, and Statistical
Tests.
ANDREW MADSON
DATA TYPES
DATA TYPES
WHY IT MATTERS
1. Appropriate Analysis: Different types of data require
different statistical tests. For example, nominal data can be
analyzed using a Chi-square test, while interval data can be
analyzed using a t-test or ANOVA. Using the wrong test can
lead to incorrect conclusions.
2. Data Visualization: Your data type determines the best
way to visualize it. For instance, categorical data might be
best represented in a bar chart, while continuous data
might be better suited for a histogram or scatter plot.
3. Data Transformation: Understanding your data type can
guide you in transforming your data, if necessary. For
example, ordinal data might be converted into interval data
under certain conditions, or continuous data might be
categorized into ordinal data.
4. Data Quality: Knowing your data type can help you
identify potential errors or inconsistencies in your data. For
instance, if you expect a variable to be continuous and find
string values, this could indicate a data quality issue.
5. Interpretation of Results: The type of data you have
influenced how you interpret your results. For example, if
you have ordinal data, you can make statements about the
order of values but not the difference between values.
QUANTITATIVE
Numerical data that can be measured
or counted and can be represented
numerically, such as height, weight,
or temperature.
QUALITATIVE
Non-numerical data that consists of
descriptive information, such as
colors, tastes, textures, or any other
characteristics that cannot be
counted or measured.
QUANTITATIVE
QUANTITATIVE
DATA TYPES
Distinct and separate values
DISCRETE with no intermediate values in
between.
Infinitely divisible and can take
on any value within a certain
CONTINUOUS range or interval. Encompasses
both INTERVAL and RATIO
data.
Continuous Data Type -
numerical data where the
INTERVAL intervals between values are
equal but no true zero point
exists.
Continuous Data Type -
numerical data with a true zero
RATIO point, allowing for meaningful
ratios and comparisons
between values.
QUALITATIVE
QUALITATIVE
DATA TYPES
Distinct categories or groups
CATEGORICAL with no inherent order or
numerical significance.
Data with a natural order or
ranking among its categories,
ORDINAL indicating relative differences
or preferences.
Categorical data that has only
BINARY two possible outcomes or
categories.
DISTRIBUTIONS
DISTRIBUTION TYPES
WHY IT MATTERS
1. Understanding the data: Understanding the distribution of
your data gives insight into the nature and behavior of the
variables you are studying. It helps you identify your data's
patterns, trends, and potential outliers.
2. Statistical assumptions: Many statistical tests and models
make assumptions about the distribution of the data. For
example, the t-test assumes that the data follows a normal
distribution. If these assumptions are violated, it can lead to
incorrect conclusions. Knowing the distribution of your data
helps you choose the appropriate statistical methods.
3. Predictive modeling: When building predictive models, the
distribution of the data can inform the selection of
algorithms or the model's configuration. Some machine
learning algorithms are more suited to certain types of
distributions.
4. Data transformation: If your data does not follow the
distribution required by a particular statistical method, you
may need to transform it. For example, if your data is
skewed, you might apply a logarithmic transformation to
make it more symmetrical. Understanding the distribution
can guide these transformations.
5. Risk management: In fields like finance and insurance,
understanding data distribution is crucial for risk
assessment. For example, the distribution of returns on
investment can help determine the probability of a
significant loss.
6. Data quality: Examining data distribution can also be a way
to check data quality. If the data doesn't follow expected
distributions, it may indicate errors or bias in the data
collection process.
PARAMETRIC
Assume that the data follows a
certain specific distribution pattern,
and the parameters of that
distribution are estimated from the
data.
NON-PARAMETRIC
Do not assume that the data follow
any specific distribution. They are
defined without the assumption of
underlying parameters
PARAMETRIC
PARAMETRIC
DISTRIBUTIONS
Symmetric around the mean,
showing that data near the
NORMAL mean are more frequent in
occurrence than data far from
the mean.
Continuous probability
distribution that models the
time it takes for an event to
WEIBULL occur and is commonly used in
reliability and survival
analysis.
Discrete probability
distribution that models the
POISSON number of events occurring in
a fixed interval of time or
space.
Continuous probability
distribution that models the
time between events in a
EXPONENTIAL Poisson process, where events
occur independently and at a
constant average rate.
NON-PARAMETRIC
NON-PARAMETRIC
DISTRIBUTIONS
Probability distribution where
all outcomes or values within a
UNIFORM given range have an equal
probability of occurring.
Based on observed data rather
EMPIRACLE than being derived from a
known mathematical formula.
Discrete probability
distribution representing a
random experiment with only
BERNOULLI two possible outcomes,
typically denoted as success
(1) or failure (0), each with a
fixed probability.
STATISTICAL
TESTS
T-TEST
Compares the
PURPOSE means of two
groups
WHEN TO USE Two related groups
IT to compare
DISTRIBUTION Normal
DATA TYPE Continuous
If there is a
WHAT IT significant
SHOWS differences between
group means
T-TEST OUTPUT
The t-value is calculated based on
the difference in means between
Test Statistic the two groups and the variability
within the groups.
The number of independent pieces
Degrees of
of information available to estimate
Freedom the population parameter.
Probability of obtaining the
observed difference (or a more
extreme difference) between the
p-value groups by chance alone, assuming
that the null hypothesis is true (i.e.,
there is no difference between the
groups)
CHI-SQUARE
Test for association
PURPOSE between variables
Assess relationship
WHEN TO USE
between categorical
IT variables
No strict
DISTRIBUTION distribution
requirement
DATA TYPE Categorical
Look for significant
WHAT IT differences between
SHOWS observed and
expected values
CHI-SQUARE
OUTPUT
Measures the discrepancy between
Chi-Square
the observed and expected
Value frequencies.
Degrees of
The number of categories minus 1
Freedom
The probability associated with the
p-value test statistic. It indicates the level
of statistical significance.
ANOVA
Compare means of
PURPOSE multiple groups
WHEN TO USE Three or more
IT groups
Normally
DISTRIBUTION distributed
DATA TYPE Numerical
Significant
WHAT IT
differences between
SHOWS group means
ANOVA OUTPUT
Information about the variation
Between
between the different groups being
Groups compared.
Within Information about the variation
Groups within each group.
Overall sum of squares and degrees
of freedom for the entire dataset,
Total combining the between and within
group variations.
REGRESSION
Examine
PURPOSE relationships
between variables
Predict the value of
WHEN TO USE
a dependent
IT variable
No strict
DISTRIBUTION distribution
requirement
DATA TYPE Numerical
Assess the strength
WHAT IT
and significance of
SHOWS relationships
REGRESSION
OUTPUT
Regression
Y = 12.345 + 0.987 * X_Variable
Equation
The intercept (12.345) represents the
estimated value of the dependent
variable when the independent
variable (X_Variable) is zero.
Coefficients The coefficient for X_Variable (0.987)
represents the estimated change in
the dependent variable for a one-unit
increase in X_Variable.
Proportion of the variance in the
R-Square dependent variable that is explained by
the independent variables.
p-value Statistical significance of a coefficient.
Mann-Whitney U
Test
Compare
PURPOSE distributions of two
groups
Compare
WHEN TO USE
distributions of two
IT independent groups
No strict
DISTRIBUTION distribution
requirement
DATA TYPE Numerical/Ordinal
Significant
WHAT IT
differences in rank
SHOWS order
MANN-WHITNEY
OUTPUT
Rank-based test statistic used in
the Mann-Whitney U test. It
U Statistic quantifies the degree of difference
between the two groups.
Statistical significance of the test. It
indicates the probability of
obtaining the observed difference
p-value between the groups if there were no
true differences in the populations
from which the samples were
drawn.
Kruskal-Wallis
Compare
PURPOSE distributions of
multiple groups
Compare
WHEN TO USE distributions of
IT three or more
independent groups
No strict
DISTRIBUTION distribution
requirement
DATA TYPE Numerical/Ordinal
Look for significant
WHAT IT
differences in rank
SHOWS order
Kruskal-Wallis
Output
Sum of ranks across all groups and
H is used to assess the differences
between the groups.
Degrees of
Number of groups minus 1
Freedom
Strength of evidence against the
null hypothesis (the assumption
p-value that there are no differences
between the groups).
Pearson's
Correlation
Measure the
PURPOSE strength of linear
relationship
Assess the strength
WHEN TO USE
and direction of a
IT linear relationship
Normally
DISTRIBUTION distributed
DATA TYPE Numerical
Look for correlation
WHAT IT
coefficient and its
SHOWS significance
Pearson's
Correlation Output
Strength and direction of the linear
relationship between the variables.
It ranges from -1 to +1. A positive
Correlation value indicates a positive
Coefficient (r) correlation, a negative value
indicates a negative correlation,
and a value close to zero indicates a
weak or no correlation.
Probability of observing the given
p-value correlation coefficient by chance.
Number of data points used to
Sample Size
calculate the correlation
(n) coefficient.
Spearman's
Correlation
Measure the
strength of
PURPOSE monotonic
relationship
Assess the strength
WHEN TO USE and direction of a
IT monotonic
relationship
No strict
DISTRIBUTION distribution
requirement
DATA TYPE Numerical/Ordinal
Look for correlation
WHAT IT
coefficient and its
SHOWS significance
Spearman's
Correlation Output
Strength and direction of the linear
relationship between the variables.
It ranges from -1 to +1. A positive
Correlation value indicates a positive
Coefficient (r) correlation, a negative value
indicates a negative correlation,
and a value close to zero indicates a
weak or no correlation.
Probability of observing the given
p-value correlation coefficient by chance.
Number of data points used to
Sample Size
calculate the correlation
(n) coefficient.
One-Sample
T-Test
Compare sample
PURPOSE mean to a known
population mean
Compare a sample
WHEN TO USE
mean to a known
IT value
Normally
DISTRIBUTION distributed
DATA TYPE Numerical
Look for significant
differences between
WHAT IT
the sample mean
SHOWS and the known
population mean
One Sample T-Test
Output
Difference between the sample
mean and the hypothesized
t-statistic population mean in terms of
standard errors
Probability of obtaining the
observed difference (or a more
p-value extreme difference) between the
sample and the hypothesized
population by chance alone.
Number of data points used to
Sample Size
calculate the correlation
(n) coefficient.
Wilcoxon
Signed-Rank
Compare paired
PURPOSE samples
WHEN TO USE Compare paired
IT observations
No strict
DISTRIBUTION distribution
requirement
DATA TYPE Numerical/Ordinal
Look for significant
WHAT IT
differences between
SHOWS paired observations
Wilcoxon Signed-
Rank Output
Summarizes the data and is used to
V assess the statistical significance of
the test.
p-value Statistical significance of the test
HOORAY!
🥳
Save this post, and tag me
as you develop these data
analytics core skills.
HAPPY LEARNING!
🙌