Data Analysis
The purpose of data analysis is to::
•Produce descriptive statistics to summarize the data.
•Create graphics which help to visualize data.
•Use inferential statistics to distinguish between significant and non-
significant effects .
•Create predictive models which can be used to predict future results within a
given experimental domain.
Data Analysis
Descriptive Statistics can be categorized in two groups:
1. Measures of centrality.
2. Measures of variation.
Measure of Advantage Disadvantage Formula
Centrality
Arithmetic Mean Can be used for Sensitive to
inferential outliers
statistics
Geometric mean Damp the effect of Cannot be used
outliers. useful for for inferential
changing data statistics
Harmonic mean Damp the effect of Cannot be used
outliers. Useful for for inferential
rates and ratios statistics
Median Insensitive to Insensitive to Exact center of
outliers the distribution distribution
of data
Data Analysis
Measure Advantage Disadvantage Formula
of
dispersion
Standard Very useful parameter, Not additive
deviation properties are well known
Variance Useful parameter. The Squares true
variance is additive dispersion
Relative Useful when comparing Cannot be used for
STD dissimilar data sets statistical inference
Standard Used when calculating Not additive
Error uncertainties
Range Simple to calculate Based on only two
data points
Types of Variables
Type of Variable Definition Examples
Continuous Variable which can take on Mass,
any value between two Concentration,
specified limits Temperature.
Nominal Categorical variable in which Type of catalyst,
there is no order Method of analysis,
Binary variable: Pass/Fail.
Ordinal Ordered categorical variable Rating scale,
Diagnosis.
For many methods of data analysis, it is important to identify the
independent variables (factors) and the dependent variable (response)
Exploratory Data Analysis (EDA)
EDA is used for the following purposes:
•To help the researcher to formulate relevant hypothesis.
•To suggest the appropriate statistical tools to analyze the data.
Many EDA techniques involve graphical displays of the data such as:
•Histograms,
•Box and whisker plots,
•Pareto charts,
•Stem-and-leaf plots,
•Multi-vari charts.
Exploratory Data Analysis (EDA)
Example
Histogram: Yield (g) B ox P lot: Y ield (g)
400 12
10
300
8
Yield (g)
No of obs
200 6
4
100
2
0 0
1 2 3 4 5 6 7 8 9 10 11 12
Y ie ld (g)
Exploratory Data Analysis (EDA)
example 2: Box plots and Correlation matrix
of IQ and 4 Test marks (2000 students)
Box & Whisker Plot
120
Correlation Matrix
100 T1 T2 T3 T4
80 IQ 0.51 0.82 0.02 0.52
60 T1 0.42 0.03 0.60
40 T2 0.04 0.55
20 T3 0.02
0
IQ T1 T2 T3 T4
Exploratory Data Analysis (EDA)
Other EDA techniques:
• Cluster Analysis Collects “similar” variables in
clusters.
• Principle Component Analysis Reduces the number of
independent variables to the
essential variables.
• Factor Analysis Used to detect the relationship
between variables.
• Discriminant Analysis Used to detect variables which
discriminate between naturally
occurring groups.
• Categorical data Analysis Studies the relationship
between nominal and
ordinal variables.
Exploratory Data Analysis (EDA)
Example : Cluster Analysis
Cluster Diagram: Four Tests
Test 1
Test 4
Test 2
Test 3
400 600 800 1000 1200 1400
Linkage Distance
Exploratory Data Analysis (EDA)
Example : Categorical Data Analysis
Contingency Tables
Diagnosis
Treatment No Little Good
Improvement Improvement Improvement
A 12* 25 30
B 4 7 8
C 34 35 36
* The number in the cells are patient counts
From this contingency table, we can determine, by
performing a chi-squared test, whether there is a significant
difference between the treatments.
Statistical Inference:
Estimating the parameters of a population from the
statistics of a representative sample.
Examples Statistics Parameters
Statistic (from sample) Parameter
Sample Mean :X Population mean μ
Sample STD: S Population STD: σ
Sample Proportion: p Population proportion: ρ
Statistical Inference
The following statement always applies:
Measurement =Parameter ± Experimental error
• Parameters can only be estimated within a calculated uncertainty.
• Whenever a estimated parameter is given, the uncertainty associated
with it, must be given as well.
• The actual calculation of the uncertainty depends on the distribution of
the data.
• The uncertainty can be visualized by using error bars
Statistical Inference
Analysis Wanted Methods Available
Compare 2 independent samples T-Test for normal data
Mann-Withney test for non-normal data
Compare 2 related samples Paired t-Test
Compare n (n>2) independent ANOVA for normal data
samples Friedmann ANOVA for non-normal data
Compare trends Regression with indicator variables
Detect the effects of factors on a
response Multiple regression
Find the levels of the factors for
which maximum or / and minimum Response Surface Modeling
responses are achieved.
Definition: Significant effect = An effect not caused by experimental error
Whether an effect is significant or not, is decided on by using p-values.