Statistik
Statistik
STATISTICS
Fabio Morellini
Behavioral Biology Group, ZMNH
INTRODUCTION TO
STATISTICS
Fabio Morellini
Behavioral Biology Group, ZMNH
AND NOT BIOSTATISTIC DEPARTMENT!
Structure of the course
• What is statistics? Why statistics?
• Designing an experiment
• Type I and type II errors, power
• Statistical software
• GraphPad Prism
• (Statistica)
Structure of the course
• What is statistics? Why statistics?
• Designing an experiment
• Type I and type II errors, power
• Statistical software
• GraphPad Prism
• (Statistica)
Structure of the course
• What is statistics? Why statistics?
• Designing an experiment
• Type I and type II errors, power
• Statistical software
• GraphPad Prism
• (Statistica)
Structure of the course
• What is statistics? Why statistics?
• Designing an experiment
• Type I and type II errors, power
• Statistical software
• GraphPad Prism
• (Statistica)
I do not need statistics
I use statistics when I have data to analyze
I costantly use statistics for my research
I know what these terms indicate: 9:30 10:55
Nominal, Ordinal, Interval Ratio measurements
Median
Percentile
Parametric
Null hypothesis
Mann-Whitney test
„P“ value
Structure of the course
• What is statistics? Why statistics?
„Two men sit in a tavern. One eats a whole knacle of veal, the
other drinks two liters of beer. From a stitistical point of
view, each had a liter of beer and a half knacle- but one has
oveareaten and the other is totally drunk. „ Franz Josef Strauß
„Fuhr vor einigen Jahren noch jeder zehnte
Autofahrer zu schnell, so ist es mittlerweile heute 'nur
noch' jeder fünfte. “
✓Collecting data
✓Summarize the data
✓Represent the data in meaningful ways
✓Determine whether our data show a pattern different
from chance or not
✓Interpretation of the data
STATISTICS provides conventional expressions for an
international audience
Reproducible
Reproducible
inference
Even if you can't find a source
of demonstrable bias, allow
yourself some degree of
skepticism about the results
as long as there is a possibility
of bias somewhere.
There always is.
SCIENTIFIC (inductive) METHOD
1. Observe some aspects of the universe, "free from bias„
9. Go to step 3.
1. Satistics gives no information about the real
word.
2. Statistics „only“ describes, summarizes and
eventually interpret observations
3. Statistics is as good (or as bad) as the
observations
4. Statistics is useful only if correctly used and
described
Which of the following are part of statistics?
O Numerical calculations
O Graphs
O Numerical calculations
O Graphs
• Descriptive statistics
Used to summarize and represent results
• Central tendency
• Variability
• Inferential statistics
Used to determine whether relationships or differences
within and between samples are not caused by chance
(“statistically significant”)
DIFFERENT TYPES OF DATA
Four Levels of Measurement
Ordinal level
Grades (1, 2, 3, 4, 5, 6) Ratio level
Steak: rare, mid-rare, medium, well
Four Levels of Measurement
Ordinal level
Grades (1, 2, 3, 4, 5, 6) Ratio level
Steak: rare, mid-rare, medium, well Body weight, length
Four Levels of Measurement
Ordinal level
Grades (1, 2, 3, 4, 5, 6) Ratio level
Steak: rare, mid-rare, medium, well Body weight, length
✓Histograms
✓Central tendency
✓Level of dispersion
✓Normal distribution
AND
LEVEL OF DISPERSION
CENTRAL TENDENCY
– Mean – average
– Mean – average
Disadvantages
– Affected by extreme values
• Example: Average salary at a company:
12.000; 12.000; 12.000; 12.000; 12.000; 12.000; 12.000;
12.000; 12.000; 12.000; 20.000; 390,000
MEAN
Simply add up all of the scores and divide by the number in
the sample.
Sum of scores / number of samples
15+12+23+30+22+5= 97 : 6= 16.16
Advantages
– Summarizes data in a way that is easy to understand
Disadvantages
– Affected by extreme values
• Example: Average salary at a company:
12.000; 12.000; 12.000; 12.000; 12.000; 12.000; 12.000;
12.000; 12.000; 12.000; 20.000; 390,000 Mean = 44.167
MEDIAN
The middle score in the data: half the scores are above
it, half of the scores are below it.
50 56 66 68 70 72 76 76 76 78 78 78 78 80 80 86 86 86 88 96 98 100 100
Median = 78
-If there is an even number of scores, the median is the average of the
two middle scores.
Example: 10, 10, 9, 9 Median = 9.5
MEDIAN
The middle score in the data: half the scores are above
it, half of the scores are below it.
50 56 66 68 70 72 76 76 76 78 78 78 78 80 80 86 86 86 88 96 98 100 100
Median = 78
-If there is an even number of scores, the median is the average of the
two middle scores.
Example: 10, 10, 9, 9 Median = 9.5
Advantages
– Not affected by extreme values
400
– Easy to compute
300
Neurons (n)
200
Disadvantages
– Doesn't use all of the data values 100
10,20,30,40,60,70,80,90 → MEAN = 50
(MEDIAN)
50,50,50,50,50,50,50,50 → MEAN = 50
(MEDIAN)
10,20,30,40,60,70,80,90 → MEAN = 50
(MEDIAN)
LEVEL OF DISPERSION
LEVEL OF DISPERSION
Neurons (n)
– Range
200
– Variance (variability around the mean)
– Standard deviation (average variability)
100
• Range
– the simplest variability statistic = highest score – lowest score.
400
300
Neurons (n)
200
VARIANCE 100
0
10,20,30,40,60,70,80,90 → MEAN = 50
2 2
(10-50) +(20-50)….
----------------------------
8-1
VARIANCE & STANDARD DEVIATION
VARIANCE
Standard Deviation
SEM=
n
IF THE SD IS UNCHANGED
THE HIGHER IS THE SAMPLE SIZE, THE LOWER IS THE SEM
PERCENTILES
• A percentile reflects the percentage of scores that are below your data
point of interest.
100th percentile
400 400
75th percentile
300 300
Neurons (n)
Neurons (n)
200 200
50th percentile (median)
100 100
25th percentile
0 0
O percentile
Histogram
A frequency distribution in graphical form
Bar graph
NORMAL (GAUSSIAN) DISTRIBUTION
Mean
Median
NORMAL (GAUSSIAN) DISTRIBUTION
CENTRAL TENDENCY
LEVEL OF DISPERSION
• The best graph is the one that makes the data clear
and exhaustive.
GRAPHICAL REPRESENTATIONS
OF CENTRAL TENDENCY AND DISPERSION
300 300
Neurons (n)
Neurons (n)
Neurons (n)
Neurons (n)
200 200
200 200
100 100
100 100
0 0 0 0
GRAPHICAL REPRESENTATIONS
OF CENTRAL TENDENCY AND DISPERSION
300 300
Neurons (n)
Neurons (n)
Neurons (n)
Neurons (n)
200 200
200 200
100 100
100 100
0 0 0 0
With small sample size or not normal Large sample size or normal distribution
distribution (parametric tests)
(non parametric tests)
Axon length (m)
0
250
500
750
A
B
Axon length (m)
0
250
500
750
A
B
A
B
A
B
Keys to making figures
The most general standards of charting data :
•Present meaningful data
•Define the data unambiguously
•Present the data efficiently
•Do not distort the data
KEEP IT SIMPLE!
DON’T “LIE”!
KEEP IT SIMPLE!
KEEP IT SIMPLE!
DON’T “LIE”!
Two basic divisions of statistics are
• (Designing an experiment)
• Type I and type II errors, power
INFERENTIAL STATISTICS
Population
Population Sample
Subset
Population Sample
? INFERENCE
NULL HYPOTHESIS
Translation to english:
P= 1.0 → 100%
P= 0.1 → 10%
P= 0.05 → 5%
P= 0.01 → 1%
THE P VALUE
160 p = 0.025
140
130
10
0
A B
p = 0.025 means that there is a probability of 2.5% that you make a mistake
when you conclude that the length of axons is shorter in group B than
group A
– BETWEEN GROUPS
Ordinal level
Grades (1, 2, 3, 4, 5, 6) Ratio level
Steak: rare, mid-rare, medium, well Body weight, length
PARAMETRIC vs NON-PARAMETRIC
Nominal level – Interval level
eye color, gender, religious affiliation. Temperature on the Fahrenheit scale,
Intelligence quotient (IQ)
Ordinal level
Grades (1, 2, 3, 4, 5, 6) Ratio level
Steak: rare, mid-rare, medium, well Body weight, length
PARAMETRIC vs NON-PARAMETRIC
PARAMETRIC
ASSUMPTIONS
• The variables must be measured at interval
or ratio scale
• A normally distributed population
• Equal variances among the population
ADVANTAGES
• Powerful
• Multi-factorial
• Good interpretation of the results
DISADVANTAGES
• Assumptions must be met
• Requires a relatively large sample size
PARAMETRIC vs NON-PARAMETRIC
PARAMETRIC NON-PARAMETRIC
ASSUMPTIONS ASSUMPTIONS
• The variables must be measured at interval • Nonparametric (distribution free) techniques
or ratio scale make no assumptions about the population
• A normally distributed population
• Equal variances among the population
ADVANTAGES ADVANTAGES
• Powerful • May be the only test when the sample size is
• Multi-factorial small
• Good interpretation of the results • Require fewer assumptions
• The only choice when the measurement
scales are nominal or ordinal (e.g. using
categories or rankings)
DISADVANTAGES DISADVANTAGES
• Assumptions must be met • They are less powerful
• Requires a relatively large sample size • Does not allow multifactorial analysis
• They are unfamiliar to many researchers and
editors
How can I test whether my data are
normally distributed with equal
variances among the population?
NORMAL (GAUSSIAN) DISTRIBUTION
Mean
Median
Kolmogorov-Smirnov
Lilliefors
20
10
-GOOD-BAKER
RELATIONSHIPS BETWEEN VARIABLES
RELATIONSHIPS BETWEEN VARIABLES
70
50
40
30
20
10
0
0 5000 10000 15000 20000 25000
PARAMETRIC Pearson r
NON-PARAMETRIC Spearman r
• Correlation simply measures relationships!
(not casualities!)
60
Sperman r = -0.72 * Cohen’s rule of thumb for r values:
50 .10 = no relationship
40
.30 = weak relationship
.50 = moderate relationship
30 > .60= strong relationship
20
10
0
0 5000 10000 15000 20000 25000
60
40
30
20
10
0
0 5000 10000 15000 20000 25000
60
*
40
30
20
10
0
0 5000 10000 15000 20000 25000
70
WT r = -0.72 *
Rearing in the open field (n)
60 KO r = -0.78 *
50
40
30
20
10
0
0 5000 10000 15000 20000 25000
70 7.5
WT r = -0.72 *
Rearing in the open field (n)
60 KO r = -0.78 *
50
Jumping (n)
5.0
40
30 WT
2.5
KO
20
10
Spearman r = 0.57 *
0 0.0
0 5000 10000 15000 20000 25000 0 5000 10000 15000 20000 25000
ChAT+ neurons in medial ChAT+ neurons in medial
septal / diagonal band of septal / diagonal band of
Broca complex (n) Broca complex (n)
Parametric and non-parametric tests
75
Rearing in the open field (n)
50
25
0
0 5000 10000 15000 20000 25000
75 75
Rearing in the open field (n)
25 25
0 0
0 5000 10000 15000 20000 25000 0 5000 10000 15000 20000 25000
750
250
0
A B
DIFFERENCES BETWEEN GROUPS
750
250
0
A
BEFORE B
AFTER
DIFFERENCES BETWEEN GROUPS
UNPAIRED Unpaired
2 GROUPS
(independent)
Paired
(dependent)
> 2 GROUPS
Unpaired
(independent)
Paired
(dependent)
(INDEPENDENT)
PARAMETRIC
Unpaired t-test Paired t-test 1-way ANOVA 1-way ANOVA for
repeated measurements
(DEPENDENT)
MULTIFACTORIAL ANOVA
DIFFERENCES BETWEEN GROUPS
NaCl (n= 8)
0.7 Anisomycin (n= 9)
0.6
Preference index
0.5
0.4
PARAMETRIC One-sample t-test 0.3
0.2
0.1
0.0
NaCl (n= 8) Anisomycin (n= 9)
0.75
Preference index
0.50
yes no
100
Mice being able to climb
** wt n=16
80
ko n=18
down (%)
60
40
20
0
Our data come from ____, but we really care most about
_____.
O populations, samples
O samples; populations
Our data come from ____, but we really care most about
_____.
O populations, samples
O samples; populations
A researcher believes that students in Hamburg will score
higher than students in Munich in a mathematical test:
Which of the following is the null hypothesis?
O 0
O1
O 0.5
A researcher hypothesize that there is a correlation
between blood testosterone concentration and
aggressive behavior. The null hypothesis is that the
correlation is equal to ?
O 0
O1
O 0.5
Which of the following probability values gives you the
most confidence that the null hypothesis is false ?
O p = .28
O p = .05
O p = .042
O p = .003
Which of the following probability values gives you the
most confidence that the null hypothesis is false ?
O p = .28
O p = .05
O p = .042
O p = .003
You have just analyzed the results from your experiment
and you calculated p= 0.13.
Which conclusion can you make?
O True
O False
The goal of statistics is to prove that the null hypothesis
is true
O True
O False
I know what these terms indicate: 9:30 10:55
Nominal, Ordinal, Interval Ratio measurements
Median
Percentile
Parametric
Null hypothesis
Mann-Whitney test
„P“ value
Structure of the course
• What is statistics? Why statistics?
• Designing an experiment
• Type I and type II errors, power
• Statistical software
• GraphPad Prism
• (Statistica)