0% found this document useful (0 votes)
357 views29 pages

Module1 Smlds Bad702 Notes

Uploaded by

chinnu.200420
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
357 views29 pages

Module1 Smlds Bad702 Notes

Uploaded by

chinnu.200420
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

SML for DS [BAD702]

Statistical Machine Learning for Data Science Semester 7


Course Code BAD702 CIE Marks 50
Teaching Hours/Week (L:T:P: S) [Link] SEE Marks 50
Total Hours of Pedagogy 40 hours Theory + 8-10 Lab slots Total Marks 100
Credits 04 Exam Hours 3
Examination nature (SEE) Theory/practical
Course objectives:
 Understand Exploratory Data Analysis
 Explain Data and Sampling Distributions
 To Analyze Statistical experiments and perform significance testing
 To demonstrate how to perform regression analysis on the data
 Explain Discriminant Analysis on the data.
MODULE-1
Exploratory Data Analysis: estimates of locations and variability, exploring data distributions, exploring binary and
categorical data, exploring two or more variables.
Textbook: Chapter 1
MODULE-2
Data and Sampling Distributions: Random sampling and bias, selection bias, sampling distribution of statistic,
bootstrap, confidence intervals, data distributions: normal, long tailed, student’s-t, binomial, Chi-square, F distribution,
Poisson and related distributions.
Textbook: Chapter 2
MODULE-3
Statistical Experiments and Significance Testing: A/B testing, hypothesis testing, resampling, statistical
significance & p-values, t-tests, multiple testing, degrees of freedom.
Textbook: Chapter 3
MODULE-4
Multi-arm bandit algorithm, power and sample size, factor variables in regression, interpreting the regression equation,
Regression diagnostics, Polynomial and Spline Regression.
Textbook: Chapter 3 & 4
MODULE-5
Discriminant Analysis: Covariance Matrix, Fisher’s Linear discriminant, Generalized Linear Models, Interpreting the
coefficients and odd ratios, Strategies for Imbalanced Data.
Textbook: Chapter 5
Course outcomes (Course Skill Set):
At the end of the course, the student will be able to:
● Analyse data sets using techniques to estimate variability, exploring distributions, and investigating
relationships between variables.
● Apply random sampling, confidence intervals, and recognize various data distributions on datasets.
● Perform significance testing and identify statistical significance.
● Apply regression analysis for prediction, interpret regression equations, and assess regression diagnostics.
● Perform discriminant analysis on the varieties of datasets.
Suggested Learning Resources:
Books
1. Peter Bruce, Andrew Bruce and Peter Gadeck, “Practical Statistics for Data Scientists”, 2nd edition, O’Reilly
Publications, 2020.
Web links and Video Lectures (e-Resources):
Statistical learning for Reliability Analysis: [Link] Engineering Statistics:
[Link]

Dept. of CSE-DS, RNSIT Smitha B A 1


SML for DS [BAD702]

Module 1
Exploratory Data Analysis
Syllabus: Estimates of locations and variability, exploring data distributions, exploring binary
and categorical data, exploring two or more variables.
Text book: Peter Bruce, Andrew Bruce and Peter Gadeck, “Practical Statistics for Data Scientists”, 2nd
edition, O’Reilly Publications, 2020 – Chapter 1

1.1 Elements of Structured Data:


Data comes from diverse sources like sensors, text, images, and videos, much of which is
unstructured. The Internet of Things (IoT) contributes heavily to this data flow. Unstructured
data—like pixel-based images, formatted text, or user clickstreams—must be transformed into
structured formats (such as tables) to be useful. A key challenge in data science is converting
raw data into structured, actionable information for analysis.

Structured data is of two main types: Numeric and Categorical.

1. Numeric
Data that are expressed on a numeric scale.
 Continuous
o Data that can take on any value in an interval. (Synonyms: interval, float,
numeric). (e.g., wind speed, time duration)
 Discrete
o Data that can take on only integer values, such as counts. (Synonyms: integer,
count). (e.g., event counts)
2. Categorical
Data that can take on only a specific set of values representing a set of possible categories.
(Synonyms: enums, enumerated, factors, nominal)
 Nominal values
o The categories represent names or labels with no inherent order or ranking among
them. (e.g., TV types, state names)
 Binary
o A special case of categorical data with just two categories of values, e.g., 0/1,
true/false. (Synonyms: dichotomous, logical, indicator, boolean). (e.g., a special
case, like yes/no or 0/1)

Dept. of CSE-DS, RNSIT Smitha B A 2


SML for DS [BAD702]

 Ordinal
o Categorical data that has an explicit ordering. (Synonym: ordered factor) (e.g.,
ordered categories, like ratings from 1 to 5)

Why identifying data as categorical or ordinal is useful in analytics?


1. Guides Statistical Procedures:
o Knowing a variable is categorical signals software to handle it differently in
models, charts, and summaries.
o For ordinal data, order is preserved (e.g., using [Link] in R or
OrdinalEncoder in Python's scikit-learn).
2. Optimizes Storage and Indexing:
o Categorical data can be stored more efficiently in databases, similar to how enums
are handled.
o Indexing such data is also more effective, improving query performance.
3. Enforces Allowed Values:
o Software can restrict the variable to a predefined set of values, reducing errors
and maintaining data consistency.
4. Improves Data Modeling:
o Categorical treatment allows correct encoding for machine learning algorithms
(e.g., one-hot, label, or ordinal encoding).
5. Clarifies Intent:
o Distinguishing categorical data from raw text helps avoid misinterpretation (e.g.,
"low", "medium", "high" as levels, not arbitrary strings).

A potential drawback of treating data as categorical, especially during import?


1. Automatic Conversion in R:
o In R, functions like [Link]() automatically convert text columns to factors
(categorical data types).
o This behavior restricts allowable values to only those seen during import.
2. Unintended Side Effect:
o If a new, unseen category is assigned to such a column later, R will raise a
warning and treat it as NA (missing value).
3. Python Handles Differently:

Dept. of CSE-DS, RNSIT Smitha B A 3


SML for DS [BAD702]

o In Python, the pandas library does not automatically convert text to categorical
when using read_csv().
o However, users can explicitly specify a column as categorical during or after
import if desired.
4. Implication:
o While categorical encoding improves performance and consistency, it can lead to
unexpected behavior if not managed carefully, especially in dynamic or
evolving datasets.

1.2 Rectangular Data:


In data science, analysis typically uses rectangular data—a two-dimensional format with rows
as records and columns as features, known as a data frame in R and Python. Since data often
starts in unstructured forms like text, it must be processed into structured features. Similarly,
relational database data needs to be combined into a single table for effective analysis and
modelling.

Dept. of CSE-DS, RNSIT Smitha B A 4


SML for DS [BAD702]

Table 1-1 includes both measured data (like duration and price) and categorical data (like
category and currency). A special type of categorical variable is the binary variable (e.g., yes/no
or 0/1), such as the indicator showing whether an auction was competitive (had multiple bidders).
This binary variable can also serve as the outcome variable in predictive scenarios.

Data Frames and Indexes:


Traditional database tables use index columns to improve query efficiency. In Python (pandas),
the main data structure is a DataFrame, which uses an automatic integer index by default but also
supports multilevel/hierarchical indexes.
In R, the primary structure is a [Link], which has an implicit row-based index. However, it
lacks native support for custom or multilevel indexes. To address this, [Link] and dplyr
packages are widely adopted in R, offering support for such indexes and improving performance.

Nonrectangular Data Structures


Besides rectangular data, other important data structures include:
 Time series data, which captures sequential measurements of a variable over time. It's
essential for forecasting and is commonly generated by IoT devices.
 Spatial data, used in mapping and location analytics, comes in two main forms:
o Object view: centers on objects (like houses) and their coordinates.
o Field view: focuses on spatial units (like pixels) and their measured values (e.g.,
brightness).
 Graph (or network) data, structures represent relationships—physical, social, or
abstract.
o Examples include social networks (like Facebook), and physical networks (like
road-connected distribution hubs).

Dept. of CSE-DS, RNSIT Smitha B A 5


SML for DS [BAD702]

o Graphs are especially useful for problems like network optimization and
recommender systems.
1.3 Estimates of Location
Variables with measured or count data might have thousands of distinct values.
A basic step in exploring your data is getting a “typical value” for each feature (variable): an
estimate of where most of the data is located (i.e., its central tendency).

Dept. of CSE-DS, RNSIT Smitha B A 6


SML for DS [BAD702]

Estimates & Metrics:


In statistics, a calculated value from data is called an estimate, highlighting the uncertainty and
theoretical focus of the field.
In contrast, data scientists and business analysts use the term metric, emphasizing practical
measurement aligned with business goals. This reflects a key difference: statisticians focus on
uncertainty, while data scientists focus on actionable outcomes.

Mean
The mean (or average) is the simplest estimate of location, calculated by summing all values and
dividing by the number of observations. For example, the mean of {3, 5, 1, 2} is 2.75. It is often
denoted as x̄ (x-bar) for a sample mean. In formulas, n refers to the sample size, while N may
refer to the full population—though this distinction is less emphasized in data science.

A trimmed mean is a variation where a fixed number of the smallest and largest values are
removed before computing the average. This helps reduce the influence of outliers. The formula
for a trimmed mean removes p values from each end of the sorted dataset and averages the
remaining n - 2p values.

A trimmed mean reduces the impact of extreme values by removing the highest and lowest
values before averaging. For instance, in international diving, judges’ top and bottom scores are
dropped to prevent bias. Trimmed means are often more reliable than regular means.
A weighted mean gives different importance to values based on assigned weights. It’s calculated
by multiplying each value by its weight and dividing by the total weight.

Weighted means are useful when:


1. Some values are more variable, so less reliable data gets lower weight.
2. Data isn't representative, and weights adjust for underrepresented groups (e.g., in
online experiments).

Dept. of CSE-DS, RNSIT Smitha B A 7


SML for DS [BAD702]

Median and Robust Estimates


The median is the middle value in a sorted dataset. If the number of values is even, the median
is the average of the two central numbers. Unlike the mean, which uses all data points and is
sensitive to extreme values, the median focuses only on the center of the data, making it more
robust to outliers.
For example, when comparing household incomes in neighborhoods, the mean can be skewed
by extremely wealthy individuals (like Bill Gates), whereas the median gives a more accurate
picture of a "typical" income.
Neighborhood={45,50,55,60,1000}
 Mean = (45 + 50 + 55 + 60 + 1000) / 5 = 242
 Median = Middle value = 55

Outliers:
The median is referred to as a robust estimate of location since it is not influenced by outliers
(extreme cases) that could skew the results. An outlier is any value that is very distant from the
other values in a data set. The exact definition of an outlier is somewhat subjective, although
certain conventions are used in various data summaries and plots.
Being an outlier in itself does not make a data value invalid or erroneous (as in the previous
example with Bill Gates). Still, outliers are often the result of data errors such as mixing data of
different units (kilometers versus meters) or bad readings from a sensor. When outliers are the
result of bad data, the mean will result in a poor estimate of location, while the median will still
be valid. In any case, outliers should be identified and are usually worthy of further investigation

The trimmed mean is another robust estimate of location, offering protection against
outliers by removing a fixed percentage of the lowest and highest values (e.g., 10% from each
end). It serves as a compromise between the mean and median—more resistant to extreme values
than the mean, while still utilizing more data than the median. This makes it especially useful for
reducing the impact of outliers in most datasets.

Example: Location Estimates of Population and Murder Rates


Table 1-2 shows the first few rows in the data set containing population and murder rates (in
units of murders per 100,000 people per year) for each US state (2010 Census).

Dept. of CSE-DS, RNSIT Smitha B A 8


SML for DS [BAD702]

Table 1-2. A few rows of the [Link] state of population and murder rate by state

To compute mean and median in Python we can use the pandas methods of the data frame. The
trimmed mean requires the trim_mean function in [Link]:
state = pd.read_csv('[Link]')
state['Population'].mean()  6162876
trim_mean(state['Population'], 0.1)  4783697
state['Population'].median()  4436370
The mean is bigger than the trimmed mean, which is bigger than the median. This is because the
trimmed mean excludes the largest and smallest five states (trim=0.1 drops 10% from each end).
If we want to compute the average murder rate for the country, we need to use a weighted mean
or median to account for different populations in the states.

1.4 Estimates of Variability


 Location is one dimension to summarize a feature (e.g., mean, median).
 Variability (or dispersion) is the second key dimension in summarizing data.
 It shows whether data values are clustered closely or spread out.
 Variability is central to statistics and involves:
o Measuring variability
o Reducing variability
o Distinguishing random variability from real variability
o Identifying sources of real variability
o Making decisions in the presence of variability

Dept. of CSE-DS, RNSIT Smitha B A 9


SML for DS [BAD702]

Standard Deviation and Related Estimates


Estimates of variation measure how data values deviate from a central location, like the mean or
median. For example, in the data set {1, 4, 4}, the mean is 3. The deviations from the mean (–2,
1, 1) indicate how spread out the values are around this central point.
To measure variability, we estimate a typical deviation from the mean. Since positive and
negative deviations cancel each other out, we use the mean absolute deviation (MAD) instead.
This involves averaging the absolute values of the deviations. For the data {1, 4, 4}, the
deviations from the mean (3) are {–2, 1, 1}, their absolute values are {2, 1, 1}, and the MAD is
(2 + 1 + 1) / 3 = 1.33.

where 𝑥̅ is the sample mean.


The best-known estimates of variability are the variance and the standard deviation, which are
based on squared deviations. The variance is an average of the squared deviations, and the
standard deviation is the square root of the variance:

 Standard deviation is easier to interpret than variance because it is on the same scale
as the original data.
 Despite being less intuitive than the mean absolute deviation (MAD), standard
deviation is more commonly used.
 The preference for standard deviation comes from statistical theory—squared values are
easier to handle mathematically than absolute values.
 Squared deviations simplify calculations in statistical models, making standard
deviation more practical in theory and application.

Dept. of CSE-DS, RNSIT Smitha B A 10


SML for DS [BAD702]

Degrees of Freedom, and n or n – 1?


In statistics, variance is often calculated using n – 1 in the denominator instead of n, introducing the concept of
degrees of freedom. While the difference is minor when n is large, it's important for accurate estimation. Using
n tends to underestimate the population variance (a biased estimate), whereas dividing by n – 1 gives an
unbiased estimate, making it more accurate when working with samples.
The concept of degrees of freedom explains why we divide by n – 1 in variance calculations. Since the sample
mean is used in the formula, one value is constrained, leaving n – 1 independent values. This adjustment corrects
the bias that occurs when dividing by n. However, for most practical applications, data scientists typically don't
need to worry about degrees of freedom in detail.

Variance, standard deviation, and mean absolute deviation (MAD) are not robust to outliers
or extreme values. Variance and standard deviation are particularly sensitive because they rely
on squared deviations, which amplify the effect of outliers. A robust alternative is the median
absolute deviation from the median, also called MAD, which better resists the influence of
extreme values.
, where m is the median.

 The median absolute deviation (MAD) uses the median as the center and is not
affected by outliers, making it a robust measure of variability.
 A trimmed standard deviation can also be used for more robust estimation, similar to
the trimmed mean.
 Variance, standard deviation, mean absolute deviation, and median absolute
deviation are not equivalent, even for normally distributed data.
 Generally:
Standard deviation > Mean absolute deviation > Median absolute deviation
 To align MAD with the standard deviation for a normal distribution, it is multiplied by
a scaling factor, commonly 1.4826, ensuring comparability.

Estimates Based on Percentiles


An alternative way to estimate dispersion is by analyzing the spread of sorted (ranked) data,
known as order statistics. The simplest measure is the range—the difference between the
maximum and minimum values. While these extreme values help in detecting outliers, the
range is highly sensitive to outliers and is not considered a reliable general measure of
variability.

Dept. of CSE-DS, RNSIT Smitha B A 11


SML for DS [BAD702]

To reduce sensitivity to outliers, variability can be estimated using percentiles, which


are values based on the ranked position of data. The Pth percentile is a value below which P%
of the data falls. For example, the median is the 50th percentile, and the .8 quantile is the same
as the 80th percentile.
A common robust measure of spread is the interquartile range (IQR), calculated as the
difference between the 75th and 25th percentiles.
For example, in the sorted data set {1,2,3,3,5,6,7,9}, the 25th percentile is 2.5, the 75th
percentile is 6.5, and the IQR is 4.
For large datasets, exact percentile calculation (which requires sorting) can be slow.
Instead, approximation algorithms (e.g., Zhang-Wang-2007) are used in software to compute
percentiles efficiently and accurately.

Example: Variability Estimates of State Population


Table 1-3. A few rows of the [Link] state of population and murder rate by state:

Using R’s built-in functions for the standard deviation, the interquartile range (IQR), and the
median absolute deviation from the median (MAD), we can compute estimates of variability for
the state population data:
> sd(state[['Population']])
[1] 6848235
> IQR(state[['Population']])
[1] 4847308
> mad(state[['Population']])
[1] 3849870

The pandas data frame provides methods for calculating standard deviation and quantiles. Using
the quantiles, we can easily determine the IQR. For the robust MAD, we use the function
[Link] from the statsmodels package:
state['Population'].std()
state['Population'].quantile(0.75) - state['Population'].quantile(0.25)
[Link](state['Population'])

Dept. of CSE-DS, RNSIT Smitha B A 12


SML for DS [BAD702]

1.5 Exploring the Data Distribution


 Boxplot
A plot introduced by Tukey as a quick way to visualize the distribution of data.
Synonym: box and whiskers plot
 Frequency table
A tally of the count of numeric data values that fall into a set of intervals (bins).
 Histogram
A plot of the frequency table with the bins on the x-axis and the count (or proportion) on the
y-axis.
 Density plot
A smoothed version of the histogram, often based on a kernel density estimate

Percentiles and Boxp lots


Percentiles not only help measure data spread but also summarize the entire distribution,
including extreme values (tails). Commonly reported percentiles include:
 Quartiles: 25th, 50th (median), and 75th percentiles
 Deciles: 10th, 20th, ..., 90th percentiles
Percentiles are especially useful for describing tail behavior, as seen in terms like "one-
percenters", which refer to individuals in the top 1% (99th percentile) of wealth.
In practice:
 In R, percentiles can be computed with:
quantile(state[['[Link]']], p=c(.05, .25, .5, .75, .95))
 In Python (pandas):
state['[Link]'].quantile([0.05, 0.25, 0.5, 0.75, 0.95])
Example output of murder rate percentiles:

The median murder rate across states is 4 per 100,000 people, but there's considerable
variability—from 1.6 (5th percentile) to 6.51 (95th percentile).
To visualize distribution and spread, especially using percentiles, we use boxplots,
introduced by Tukey (1977).

Dept. of CSE-DS, RNSIT Smitha B A 13


SML for DS [BAD702]

Boxplots provide a compact summary of data, showing the median, quartiles, and potential
outliers.
A boxplot provides a clear visual summary of a dataset's
distribution, including its central tendency, spread, and outliers.
For example, in the boxplot of state populations:
 The median is around 5 million.

 Half of the states have populations between ~2 million and ~7

million.
 Some states are high-population outliers.

Boxplot components:
 Box edges = 25th and 75th percentiles (Q1 and Q3).

 Line inside the box = median (50th percentile).


 Whiskers = extend up to 1.5 × IQR beyond the box, stopping at the most extreme non-
outlier values.
 Points outside whiskers = outliers, shown as individual dots or circles.
This structure helps identify skewness, spread, and extremes in the data.

Frequency Tables and Histograms


A frequency table divides a variable's range into equal-sized intervals and counts how many
data values fall into each. This helps visualize the distribution of values across the range.
binnedPopulation = [Link](state['Population'], 10)
binnedPopulation.value_counts()
These steps create 10 equal-width bins and count the number of states in each bin. This table
provides a discrete summary of how the population is distributed across states.

When creating a frequency table, the range of


the data (in this case, U.S. state populations
from 563,626 to 37,253,956) is divided into
equal-sized bins—e.g., 10 bins, each about
3.67 million wide.
 Example:
o First bin: 563,626 to 4,232,658
o Top bin: 33,584,923 to 37,253,956
(contains only California)

Dept. of CSE-DS, RNSIT Smitha B A 14


SML for DS [BAD702]

o Two bins below California are empty, until we reach Texas


Key insights:
 Empty bins are informative and should be displayed—they indicate gaps in the data.
 Bin size matters:
o Too large → important distribution details may be hidden.
o Too small → the view becomes overly detailed and less interpretable.
 It's often helpful to experiment with different bin sizes to strike a balance between
detail and clarity.

A histogram visualizes the distribution of numerical data by grouping values into equal-width
bins and plotting the frequency (count) on the y-axis.
Key Characteristics of Histograms:
 Bins are of equal width.
 Empty bins are shown to reflect gaps in the data.
 Number of bins is user-defined and affects the clarity/detail of the visualization.
 Bars are contiguous—they touch each other unless a bin has no data.

Figure: Histogram of state populations


Statistical Moments:
In statistics, moments describe key characteristics of a data distribution:
 1st moment (Location): Measures central tendency (e.g., mean).
 2nd moment (Variability): Measures spread (e.g., variance, standard deviation).
 3rd moment (Skewness): Indicates if the distribution is asymmetric, skewed toward
larger or smaller values.
 4th moment (Kurtosis): Reflects the likelihood of extreme values or heaviness of tails.

Dept. of CSE-DS, RNSIT Smitha B A 15


SML for DS [BAD702]

Density Plots and Estimates


A density plot is a smoothed version of a histogram that shows the distribution of data as a
continuous curve. Unlike histograms, density plots are created using kernel density estimation
(KDE), which directly computes a smooth approximation of the data distribution.
Key Points:
 Density plots help visualize underlying patterns without the rigidity of bin edges.
 They are useful for identifying peaks, spread, and skewness in the data.
 Often, a density curve is overlaid on a histogram for comparison.

A key difference between a density plot and a histogram is the y-axis scale:
 Histogram: y-axis shows counts (number of data points in each bin).
 Density plot: y-axis shows proportions (relative frequencies), and the total area under
the curve equals 1.
 In a density plot, the area between two x-values represents the proportion of data in
that interval.
 This makes density plots better for comparing distributions, as they are normalized,
regardless of sample size.

Density plots provide a more nuanced view of distribution shape while maintaining
proportional accuracy.

Dept. of CSE-DS, RNSIT Smitha B A 16


SML for DS [BAD702]

1.6 Exploring Binary and Categorical Data


Simple proportions or percentages tell the story of the data.

Getting a summary of a binary variable or a categorical variable with a few categories is a fairly
easy matter: we just figure out the proportion of 1s, or the proportions of the important categories.
For example, Table 1-6 shows the percentage of delayed flights by the cause of delay at
Dallas/Fort Worth Airport since 2010. Delays are categorized as being due to factors under
carrier control, air traffic control (ATC) system delays, weather, security, or a late inbound
aircraft.

Bar charts, seen often in the popular press, are a common


visual tool for displaying a single categorical variable.
Categories are listed on the x-axis, and frequencies or
proportions on the y-axis. Figure shows the airport delays per
year by cause for Dallas/Fort Worth (DFW).

Bar charts display separate categories with spaced bars, while histograms show continuous
numeric data with adjacent bars. Pie charts are often avoided by experts for being less effective
than bar charts.
Mode
 The mode is the most frequently occurring value in a dataset.
 It is mainly used for summarizing categorical data, not typically numeric data.

Dept. of CSE-DS, RNSIT Smitha B A 17


SML for DS [BAD702]

Expected Value
A special type of categorical data involves categories that can be mapped to discrete values on
the same scale.
For instance, a cloud technology marketer offers two service tiers: $300/month and $50/month.
From webinar leads, 5% subscribe to the $300 plan, 15% to the $50 plan, and 80% opt out.
This data can be financially summarized using expected value—a weighted mean where
outcomes are multiplied by their probabilities and summed.
In this case, the expected value per attendee is:
EV = (0.05 × 300) + (0.15 × 50) + (0.80 × 0) = $22.50/month
Expected value reflects future outcomes weighted by likelihood and is key in business valuation
and capital budgeting, such as estimating future profits or cost savings.
Probability
The concept of probability is commonly encountered in everyday contexts like weather forecasts
and sports, often expressed as odds. These odds can be converted into probabilities (e.g., 2 to 1
odds equals a 2/3 probability).
While defining probability can lead to deep philosophical debates, for practical purposes, it can
be understood as the proportion of times an event would occur if repeated infinitely—a useful,
operational view of probability.

1.7 Correlation
In exploratory data analysis, a key step is examining correlations between variables. Two
variables are positively correlated if high values of one tend to align with high values of the
other, and low with low. They are negatively correlated if high values of one align with low
values of the other. This helps in understanding relationships among predictors and the target
variable in modeling projects.
Consider these two variables, perfectly correlated in the sense that each goes from low to high:
v1: {1, 2, 3}
v2: {4, 5, 6}

Dept. of CSE-DS, RNSIT Smitha B A 18


SML for DS [BAD702]

When two variables are perfectly correlated (e.g., both increasing), their sum of products (like
1·4 + 2·5 + 3·6 = 32) reaches a maximum. Shuffling one variable lowers this sum, forming the
basis of a permutation test. However, this raw sum isn’t very interpretable.
A better standardized measure is Pearson’s correlation coefficient (r), which compares the
variables’ deviations from their means, scaled by their standard deviations. This value ranges
from –1 (perfect negative) to +1 (perfect positive), with 0 indicating no linear correlation. It’s
worth noting that nonlinear associations may not be well captured by this metric.

The formula uses n – 1 in the denominator (instead of n) to account for degrees of freedom,
which ensures an unbiased estimate when using sample data.
The correlation coefficient measures only linear relationships, so it may not accurately
reflect associations that are nonlinear. For example, the relationship between tax rates and tax
revenue is nonlinear: revenue rises with increasing tax rates initially, but beyond a certain point,
higher rates lead to increased tax avoidance and lower revenue. In such cases, the correlation
coefficient can be misleading.

The correlation matrix table displays the relationships between daily returns of
telecommunication stocks from July 2012 to June 2015. Verizon (VZ) and AT&T (T) show the
highest correlation, while Level 3 (LVLT) has the lowest correlation with others. The diagonal
contains 1s, indicating each stock's perfect correlation with itself, and the matrix is symmetric,
with redundant values above and below the diagonal.
Table: Correlation between telecommunication stock returns

A table of correlations is commonly plotted to visually display the relationship between multiple
variables. Figure 1-6 shows the correlation between the daily returns for major exchange-traded
funds (ETFs).
Python support the visualization of correlation matrices using heatmaps. The following code
demonstrates this using the [Link] package.

Dept. of CSE-DS, RNSIT Smitha B A 19


SML for DS [BAD702]

The ETFs for the S&P 500 (SPY) and the


Dow Jones Index (DIA) have a high
correlation. Similarly, the QQQ and the XLK,
composed mostly of technology companies,
are positively correlated.
Defensive ETFs, such as those tracking gold
prices (GLD), oil prices (USO), or market
volatility (VXX), tend to be weakly or
negatively correlated with the other ETFs.

The ellipse’s orientation shows the direction of correlation—top right for positive, top left for
negative—while its shape and shading reflect strength: thinner and darker ellipses indicate
stronger associations.
Note: Like the mean and standard deviation, the correlation coefficient is sensitive to outliers. To address this,
software packages provide robust alternatives. For example, the R package robust uses covRob for robust
correlation estimates, and scikit-learn’s [Link] module offers several such methods in Python

Other correlation estimates: like Spearman’s rho and Kendall’s tau use ranked data,
making them robust to outliers and suitable for nonlinear relationships. While useful in small
datasets or specific hypothesis tests, Pearson’s correlation and its robust alternatives are
typically preferred for exploratory data analysis in larger datasets.

Dept. of CSE-DS, RNSIT Smitha B A 20


SML for DS [BAD702]

Scatterplots
The standard method to visualize the relationship
between two variables is a scatterplot, where each
point represents a record with one variable on the x-
axis and the other on the y-axis. For example,
plotting ATT vs. Verizon daily returns in R or
Python shows a positive correlation, as most points
fall in the upper-right and lower-left quadrants,
indicating the stocks often move together.
However, with 754 points, it's hard to see patterns
in dense areas. Techniques like transparency,
hexagonal binning, and density plots can reveal more structure.

1.8 Exploring Two or More Variables


Familiar estimators like mean and variance look at variables one at a time (univariate analysis).
Correlation analysis is an important method that compares two variables (bivariate analysis). In
this section we look at additional estimates and plots, and at more than two variables
(multivariate analysis). The appropriate type of bivariate or multivariate analysis depends on the
nature of the data: numeric versus categorical.
Hexagonal Binning and Contours (Plotting Numeric Versus Numeric Data)
Scatterplots work well for small datasets (e.g., ~750 points), such as the stock returns in above
Figure. However, for large datasets with hundreds of thousands or millions of records,
scatterplots become too dense to be useful. In such cases, alternative visualization techniques are
needed.
For example, in analyzing the kc_tax dataset (tax-assessed residential property values in King
County, Washington), outliers like extremely expensive or unusually sized homes are removed
using the subset function to better focus on the core data distribution. In pandas, we filter the
data set as follows:

Dept. of CSE-DS, RNSIT Smitha B A 21


SML for DS [BAD702]

Figure uses a hexagonal binning plot to show the relationship


between finished square feet and tax-assessed value of homes in
King County. Unlike scatterplots, which become unreadable with
dense data, this method groups data into hexagons and uses color
to represent density. The plot clearly shows a positive correlation
between square footage and value. Notably, it also reveals subtle
patterns—such as bands indicating homes with similar square
footage but higher tax values.
Fig: Hexagonal binning for tax-assessed value versus finished square feet
This visualization was created using the ggplot2 package in R, a powerful tool for advanced
exploratory data analysis.
In Python, hexagonal binning plots are readily available using the pandas data frame method
hexbin:

Figure overlays contours on a scatterplot to show data density


between two numeric variables, like a topographic map. Each contour
band indicates increasing point density toward a peak. Similar to
hexagonal binning Figure, it reveals a main cluster and a secondary
peak, highlighting areas of concentrated data.
Fig: Contour plot for tax-assessed value versus finished square feet
The seaborn kdeplot function in Python creates a contour plot:

Charts like heat maps, hexbin plots, and contour plots help show how two numeric variables
relate by displaying data density. They are similar to histograms and density plots but work in
two dimensions.

Dept. of CSE-DS, RNSIT Smitha B A 22


SML for DS [BAD702]

Two Categorical Variables


In Statistical Machine Learning, analyzing two categorical variables is often done to
understand relationships or associations between them. A contingency table (also called a cross-
tabulation) is a common tool used to summarize and analyze such relationships.

Example: Loan Approval and Employment Type


Let’s say we’re studying the relationship between:
 Loan Approval Status (Approved, Rejected)
 Employment Type (Salaried, Self-Employed, Unemployed)
Contingency Table

Employment Type Approved Rejected Total

Salaried 120 30 150

Self-Employed 50 40 90

Unemployed 10 50 60

Total 180 120 300

A contingency table is a useful tool to summarize


the relationship between two categorical variables.
In this case, it shows the distribution of personal
loan grades (from A to G) against loan outcomes
(fully paid, current, late, or charged off), based on
Lending Club data. The table includes counts and
row percentages, revealing that high-grade loans
(e.g., A) have significantly lower rates of late
payments or charge-offs compared to lower-grade
loans.
Contingency tables can display simple counts or
also include column percentages and overall totals
for deeper insights. They are commonly created
using pivot tables in Excel.

Dept. of CSE-DS, RNSIT Smitha B A 23


SML for DS [BAD702]

Categorical and Numeric Data

Categorical and Numeric Data


Boxplots are a straightforward way to visually compare the distribution of a numeric variable
across categories of a categorical variable. For instance, to examine how the percentage of flight
delays (within the airline’s control) varies by
airline, a boxplot can be used. In R, this can
be done with:
boxplot(pct_carrier_delay ~ airline,

data=airline_stats, ylim=c(0, 50))


In Python (using pandas), a similar plot is
created with:
ax = airline_stats.boxplot(by='airline',

column='pct_carrier_delay')

ax.set_xlabel('')

ax.set_ylabel('Daily % of Delayed

Flights')

[Link]('')

Alaska stands out as having the fewest delays, while American has the most delays: the lower
quartile for American is higher than the upper quartile for Alaska.

Dept. of CSE-DS, RNSIT Smitha B A 24


SML for DS [BAD702]

A violin plot, introduced by Hintze and Nelson (1998), enhances the traditional boxplot by
including a mirrored density plot, creating a violin-like shape. This allows it to display detailed
distribution patterns that boxplots might miss. While violin plots reveal nuances in data density,
boxplots are better at clearly identifying outliers.

In R:
ggplot(data=airline_stats, aes(airline, pct_carrier_delay)) +

ylim(0, 50) +

geom_violin() +

labs(x='', y='Daily % of Delayed Flights')

In python:
ax = [Link](airline_stats.airline, airline_stats.pct_carrier_delay,

inner='quartile', color='white')

ax.set_xlabel('')

ax.set_ylabel('Daily % of Delayed Flights')

For example, violin plots reveal a strong concentration of low delays for Alaska Airlines, which
is less apparent in the boxplot. You can combine both plots (e.g., using geom_boxplot() with
geom_violin()) to get the benefits of both visualizations, especially with the help of color for
clarity.

Visualizing Multiple Variables


Charts like scatterplots, hexagonal binning, and boxplots can be extended to more than two
variables using conditioning—plotting subsets of data based on a third variable. For example,
in Figure 1-8, a scatterplot showed clusters in the relationship between finished square feet and
tax-assessed home values. By conditioning on zip code (as in Figure below), the data reveals
that higher values per square foot occur in specific zip codes (e.g., 98105, 98126), while others
(e.g., 98108, 98188) have lower values. This explains the clustering seen earlier and highlights
how location influences property assessments.

Dept. of CSE-DS, RNSIT Smitha B A 25


SML for DS [BAD702]

Dept. of CSE-DS, RNSIT Smitha B A 26


SML for DS [BAD702]

Extra: for your reference for Understanding Correlation


1. Sum of products (raw idea of correlation)

o If you have two variables, say X = [1, 2, 3] and Y = [4, 5, 6], and both move together
(when one increases, the other also increases), then multiplying corresponding values
and adding them up gives a big number:

1⋅4+2⋅5+3⋅6=32

o This number is large because the patterns of the two variables match well.

2. Shuffling breaks the pattern

o If you randomly shuffle Y, like Y = [6, 4, 5], then:

1⋅6+2⋅4+3⋅5=29

o The sum is smaller now because the perfect alignment is broken.

o This idea—comparing the actual sum to sums from shuffled data—is the basis of a
permutation test (used in statistics to test whether correlation is meaningful or just by
chance).

3. Problem with raw sums

o The sum itself (like 32 or 29) doesn’t mean much because it depends on the scale of
the numbers. For bigger numbers, the sum is automatically bigger, even if the
relationship is the same.

4. Pearson’s correlation coefficient (r)

o To make the measure standardized and comparable, we calculate Pearson’s r.

o Instead of raw values, it looks at how much each value deviates from its mean, and
then scales by their standard deviations.

Dept. of CSE-DS, RNSIT Smitha B A 27


SML for DS [BAD702]

o Formula (conceptually):

r=sum of standardized products/number of data points


points}}r=number of data pointssum of standardized products

o This normalization makes the result fall between –1 and +1:

 +1 → perfect positive linear relationship (both go up together).

 –1 → perfect negative linear relationship (one goes up, the other goes down).

 0 → no linear relationship.

5. Limitation

o Pearson’s r only measures linear relationships. If the data follows a curved pattern
(like a U-shape), r might be close to 0 even though there is a strong nonlinear
association.

 Left plot: X and Y rise together perfectly, so Pearson’s correlation r=1.00r = 1.00r=1.00.

 Right plot: After shuffling Y, the pattern breaks, and the correlation drops close to 0.

This shows why the raw sum of products decreases after shuffling, and why Pearson’s r is a better
standardized way to measure correlation.

🔹 1. Pearson’s correlation

 Measures linear relationships.

 Sensitive to outliers (a single extreme point can drastically change rrr).

 Works best when data is continuous and roughly linear.

But what if the relationship is nonlinear or if outliers are present?


That’s where other measures come in.

Dept. of CSE-DS, RNSIT Smitha B A 28


SML for DS [BAD702]

2. Spearman’s rho (ρ)

 Instead of using raw data, it converts the values into ranks.


Example: [10,20,30] → ranks [1,2,3].

 Then, Pearson’s correlation formula is applied on these ranks.

 Advantage:

o Robust to outliers (since only rank matters, not actual value).

o Captures monotonic relationships (as long as one variable goes up when the
other goes up, even if it’s curved).

Example: If Y=X2Y = X^2Y=X2, Pearson’s r might be near 0, but Spearman’s ρ will still be high
because bigger X always means bigger Y.

3. Kendall’s tau (τ)

 Also rank-based, but works differently:

o Looks at pairs of data points.

o Counts how many pairs are in the same order (concordant) vs. in opposite order
(discordant).

 Advantage:

o More interpretable (as a probability of agreement between rankings).

o Very robust in small datasets.

4. When to use which?

 Pearson’s r → Default choice for linear, continuous data (large datasets).

 Spearman’s rho → Better when relationship is nonlinear monotonic or when outliers


exist.

 Kendall’s tau → Good for small samples or when data is ordinal (rank-based by nature).

Dept. of CSE-DS, RNSIT Smitha B A 29

You might also like