Birmingham International Academy
Advanced Study Skills in
Biological Sciences:
Data Handling, Statistics &
Describing data.
Richard Banks
[email protected]Statistics
• Statistics is a collection of mathematical techniques
that help to analyse and present data
• vital to the scientific method
- used to confirm or reject a hypothesis
• Classified into
‘Descriptive statistics’ and ‘Inferential statistics’
Descriptive Statistics
Used to summarise the basic features of a data set
• measures of central tendency
mean, mode, median
• measures of spread
range, standard deviation, standard error
• measures of distribution
skewness
3
Variability in ‘biological’ data
Biological data often has a ‘Normal distribution’
i.e. A frequency distribution with the most frequent number near
the middle: central tendency
Frequency distribution.
• Number of times an observation occurs in the data set
• Often presented in a table or a histogram
• % Frequency can be calculated:
frequency of an observation X 100
total number of observations
Result Frequency
0 2
1 9 frequency of 0 = 2 / (2+9+26+25+10+3) x 100 = 2.67%
2 26 frequency of 2 = 26/(2+9+26+25+10+3) x 100 = 34.67%
3 25
4 10
5 3
• % Frequency can then be used to create a distribution histogram
Task:
• Calculate the % Frequency of the data set.
• Produce a sketch diagram of the percentage
distribution graph of the table from the
previous slide (also below):
Normal distribution: data with central tendency
Biological data often has a normal distribution
i.e. has a frequency distribution with the most frequent number
near the middle, i.e. central tendency
therefore, measuring of the "middle" value of the data set is useful
Measures of central tendency
Sum of observations
Σx
• Mean: x = n Number of observations
• Median: equal number of values above and
below (=Middle)
• Mode: Value with the highest frequency (=Most)
• A data set can be bimodal or even multimodal, with 2 or
more values being equally frequent.
*sample mean used as an estimate of the population mean
The Mean
• The mean (or average) is calculated by adding up all the
individual values and dividing the total by the number of values
The Median
• The median value is identified by putting all of the individual
values in size order (smallest to largest) to find the middle value
(if there are an even number of individual values, take the mean of the two
middle values)
The Mode
• The mode is the value that occurs most often in the data set
Task
• Make a start on
completing the
questions 1-5 on the
worksheet.
• 10 minutes
An example ‘data set’:
2 , 4 , 2 , 0 , 40 , 2 , 4 , 3 , 6
Calculate the mean, median and mode
Σx
Mean: x = n Σx = n= x =
Median: sort data
Mode: (Occurs the most times)
Which is most representative of the centre of the data?
An example ‘data set’:
2 , 4 , 2 , 0 , 40 , 2 , 4 , 3 , 6
Calculate the mean, median and mode
Σx 63
Mean: x = n Σx = 63 n = 9 x = 9 = 7
Is this an error / ‘real’ data point ?
Median: sort data 0 2 2 2 3 4 4 6 40
Mode: 2 (Occurs the most times)
Which is most representative of the centre of the data?
What happens if we exclude the ‘outlier’ ?
2 4 2 0 40 X2 4 3 6
New data set:
2 4 2 0 2 4 3 6
Σx 23
Mean: x = n Σx = 23 n = 8 x = 8 = 2.875 It was 7
Median: sort data 0 2 2 2 3 4 4 6 It was 3
Mode: 2 - occurs the most times It was 2
An ‘outlier’ can have a disproportionate effect on the mean
Median is a reasonably typical value (resistant to outliers)
Range: Difference between the maximum & minimum
• An estimate of the spread of the data (= ‘dispersion’)
e.g. experimental data of weight of lab rats
320 , 367, 423, 471, 480 grams
Range is calculated as ………………………….. 480 - 320 = 160 g
• Useful, BUT some data can be very different from other data
points – outliers
e.g. a small baby rat added to the data set
150, 320 , 367, 423, 471, 480 g 480 - 150 = 330 g
• So, not always an accurate description of the overall data set
Data with outliers: The mean and the range are altered
to a greater extent by outliers.
‘Normal’ distribution
Symmetrical
Mean = Median = Mode
Mean
Median
Mode ‘Skewed’ distributions
- caused by ‘outliers’
Positive (right) skew
Long upper tail (high values)
Mean > Median > Mode
Mean
Median
Mode
Mean is moved in the
direction of the skew
Negative (left) skew
Long lower tail (low values)
Mean < Median < Mode
Mean
Median
Mode
Task
• Make a start on
completing the
questions 6-8 on the
worksheet.
• 5 minutes.
Break
Standard deviation (SD)
A measure of how data is distributed about the mean
• Standard deviation is a measure of the distance of an
individual value from the overall sample mean
• Allows us to quantify the variability within the data
• Expressed as Mean SD
• The lower the standard deviation, the less uncertainty
[or] More confidence in the experimental result
Standard deviation (SD)
A measure of how data is distributed about the mean
• Mean SD
• Eg 55.3 3.3
• This means that 68% of the values in the data set lies within 6.6
of the mean value, ie from 52.0 to 58.6
• 95% of the values fall within 2SD ie 48.7 to 61.9
Standard deviation (SD)
a measure of how data is distributed about the mean
• Less spread of data around the
mean = small standard deviation
• More confidence in the data set.
• High spread around the mean =
higher standard deviation
• Lower confidence in the data set.
Standard Deviation and Variance
Sample Variance (S2)
x = each score/value
= mean (average)
n = number of scores/values
= sum of…
Standard Deviation = √ Variance
Standard deviation (SD):
Calculation Task:
•In a learning behaviour
study, rats had to press a
leaver to gain a food
reward.
•Number of leaver presses,
before rat gave up trying to
access food reward are
given on the next slide.
•Can you work out the
standard deviation of the
data set?
Standard deviation (SD)
calculation
Task 10 minutes:
Repetition of lab rat leaver pressing in a reward experiment:
Number of leaver presses: 9, 2, 5, 4, 12, 7, 8, 11, 9, 3, 7, 4, 12, 5, 4,
10, 9, 6, 9, 4.
n=20
To calculate SD
Step 1: Calculate the mean, . Add up all the numbers and divide by
the total number of data. = 7
Step 2: Subtract the mean from each data point and then square
each value.
Step 3: Calculate the sum of the squared values.
Step 4: To calculate the variance, divide the sum of the squared
values by n-1.
Step 5: The standard deviation is the square root of the variance.
Use a calculator to obtain this number.
Standard deviation (SD)
calculation
Task 10 minutes:
Usefulness of Standard Deviation
gives ‘reliability’ measure - 95% confidence interval (CI)
= 2 x SD
= range above and below the mean within which 95% of
the measurements lie
Expressing data points with SD error bars
Symbol or bar that indicates Mean value
Vertical line representing size of standard deviation
Standard Error (SE)
• SE is related to, but is not the same as, the
standard deviation (SD)
• SE = SD/√N
N = sample size
• expressed as Mean SE, N= (sample size)
Overlap between error bars
If SE bars do not overlap this indicates differences in means are
meaningful
• Requires an appropriate Statistical Test to confirm
Task
• Make a start on
completing the
questions 9-13 on the
worksheet.
• 10 minutes.
Statistical Analyses
- A hypothesis can be confirmed by statistical approaches if sufficient
data has been collected.
- A statistical test confirms whether the difference between data sets is
statistically significant.
- The tests used depends on whether the data collected is independent
or matched/paired & the level of data collected (nominal, categorical,
ordinal, quantitative).
Statistical Analyses
T-Test
Can be used to test a hypothesis to
determine whether there is a significant
difference between the means of two
data sets that are normally distributed.
If the difference between the two data
sets is significant then the null
hypothesis can be rejected and the
alternative hypothesis, which always
states there is a significant difference
between the sets of data can be
accepted (see lecture 1, The Scientific
Method).
Statistical Analyses
Chi-Squared Test
Is used to determine whether there is a significant
difference between the observed set of data obtained
from an investigation is statistically significantly different
from that which was originally expected and stated in the
hypothesis.
The null hypothesis will always state there will be no
difference between the observed and expected values.
This test is often used in inheritance studies to see if
observable characteristics follow mendelian ratios
(eg 3:1 or 9:3:3:1)
Task
• Complete the questions
on the worksheet (14-
16), so you have them
ready for revision.
• Answer sheets will be
made available on
Canvas after the
session.
Useful links for more information
• University of Birmingham Academic Skills Gateway
http://libguides.bham.ac.uk/asg
• http://www.stats.gla.ac.uk/steps/glossary/presenting_data.
html
• http://explorable.com/statistics-tutorial
• http://www.engageinresearch.ac.uk/section_4/step_by_ste
p_statistics.shtml
• http://www.statstutor.ac.uk/topics/
• https://www.bmj.com/about-bmj/resources-readers/public