Biostatistics A4
Biostatistics A4
TRAINING SOLUTIONS
All Pharmacy Training Materials and Consultancy Services
P. O. BOX 51, Fort Portal - Uganda
Copyright ©2021: By Mwesigwa Wilson
Title: Biostatistics
All rights reserved
No part of this book may be reproduced or transmitted in any
form by electronic, or mechanical including photocopying or
recording without written permission of the copyright from the
author except for the use as quotation
ISBN 978-9913-619-47-9
Disclaimer
The information in the handbook has been researched to ensure
it’s accurate. Always refer to standard available textbooks.
However, the writer and editor may not be held responsible for
any errors and omissions in the handbook
BIOSTATISTICS
A simplified handbook
2021 Edition
AUTHOR
Dr. Mwesigwa Wilson (B Pharm)
Bachelors Degree of Pharmacy – KIU
Postgraduate Diploma in Monitoring & Evaluation – MMU
Certificate in Medical Informatics
Presented by
MPHARMA
TRAINING SOLUTIONS
All Pharmacy Training Materials and Consultancy Services
P. O. BOX 51, Fort Portal - Uganda
MPHARMA
TRAINING SOLUTIONS
All Pharmacy Training Materials and Consultancy Services
P. O. BOX 51, Fort Portal - Uganda
Biostatistics I
TABLE OF CONTENTS
Introduction to Biostatistics---------------------------------------------- 1
Sampling --------------------------------------------------------------------- 31
Sampling methods ------------------------------------------------------- 31
Sample size determination ---------------------------------------------- 36
INTRODUCTION TO BIOSTATISTICS
Inferential statistics
The branch of statistics concerned with using sample data to interpret the descriptive
data and make an inference about a population of data. Used to generalize data from
samples to population, hypothesis testing and determine any association between
variables
VARIABLES
DATA
Data are values recorded on one or more observational units. Data can be
quantitative or qualitative; are values of the observation recorded for variables
Types of data
▪ Categorical (Qualitative) data
▪ Numerical (Quantitative) data - Discrete or continuous
▪ Based on their mathematical properties, data are divided into four groups:
Nominal, Ordinal, Interval, Ratio
Categorical (Qualitative) data
Qualitative data comprises of a characteristic which cannot be expressed
numerically such as gender, religion etc. subdivided into 3 types
Binary data: The variables or characteristics are divided into mutually exclusive
categories such as gender (male or female), diseased/not diseased, alive /dead
Nominal data: means name and count or variables with more than two categories
where order does not matter; data are alphabetic or numerical in name only and are
divided into more than two mutually exclusive categories. However, the categories
are without order or direction. Their use is restricted to keeping track of people,
objects and events. They are least powerful in measurement with no arithmetic
origin, order, direction or distance relationship
Such as blood groups (O, A, B and AB), marital status (single, married, divorced,
widowed), employment status (self-employed, public employee, unemployed) etc
Ordinal data: the variables are ordered or ranked like rankings or scaling. Ordinal
data use a Likert scale such as agree, neither agree, disagree, neither disagree and;
neutral a level of knowledge (excellent, good, average, poor), quality rating of
service or product, etc
Qualitative or numerical data
This is data that can be expressed numerically like age, temperature, height etc.
Classified into two
Discrete data can take only certain values by a finite or values take whole numbers
such as number of students in the class. Discrete data ‘jumps’, i.e., it ‘jumps’ from
one value to another but does not take any intermediate value between them
Continuous data can take any numerical value (within a range); for example, weight,
height, etc with an infinite number of possible values. In continuous data, here are
no gaps in the values of variables and have unlimited number of possible values
Note:
▪ Interval data: the intervals between values are the same. For example, in
Fahrenheit temperature scale
▪ Ratio data: The data values in ratio data do have meaningful ratios such as
20 -30
4 Biostatistics
Data sources are broadly classified into: Primary and secondary data sources
Primary data
Primary data means original data that collected by the investigator for the purpose of
study. The data is original in character and mostly generated by surveys, laboratory
and experimental methods; not yet been published. The data is more reliable and
accurate since it is first-hand information by the research investigator.
Merits
▪ The investigator collects data specific to the objective under study
▪ Data interpretation is better since targeting characteristics of data are
collected
▪ There is high the quality of the data collected
Demerits
▪ High cost in obtaining the data
▪ Time consuming involving collecting, getting the data collected and then
data analysis
▪ There may be bias in data collection
Secondary data
This is when the investigator uses data that has already been collected and readily
available from other sources. This can be divided into internal sources (within the
organizations such as reports) and external sources (outside the organization)
Such as data obtained from journals, reports, publications, compilations from
computerized databases and information systems etc.
Merits: Less expensive and Save times
Demerits
▪ Missing data can affect the results
▪ There may be errors
▪ Sample size may be inadequate
Data collection allow us to systematically collect data from the population about a
characteristic under study
Importance of data and data collection
▪ Data is one of the most important and vital aspect of any research studies.
Researches conducted in different fields of study can be different in
methodology but every research is based on data which is analyzed and
interpreted to get information.
▪ Data is the basic unit in statistical studies.
▪ Statistical information like census, population variables, health statistics, and
road accidents records are all developed from data
Data collection methods 5
Activity
✓ Explain factors selected when selecting data selection method (cost, human
resource, analysis, type of data collected etc.)
Mailed questionnaires
The questionnaires are prepared by investigator and sent by post or electronic to the
respondents who provides the replies and return or send back a fully filled
questionnaire
Observations
This is a technique that involves systematically selecting, watching and recoding
behaviours of subjects or any other characteristic under study for the purpose of
gaining specified information. Different methods are used such as simple visual
observations, use of camera or other equipment’s (radiographic, x-ray, microscope)
done while letting the observing person know that he is being observed or without
letting him know they are under observation
Merits: Provide more accurate on behavior
Demerits
▪ Observers own bias or desires affect quality of data
▪ Needs more resources such as skilled labor
▪ Expensive if sophisticated equipment’s are used
Experiments
They are performed in controlled areas with controlled conditions such as
laboratories. The data collected per the specific objectives and later analyzed
Use of available data sources
▪ Records from hospitals
▪ International Publications like Publications by WHO, World Bank, UNICEF
▪ Publication of Ministry of Health and Other Ministries
▪ Published printed sources: There are varieties of published printed sources.
Their credibility depends on many factors. For example, on the writer,
publishing company and time and date when published.
▪ Books; Books are available today on any topic that you want to research.
After selection of topics books provide insight on how much work has
already been done on the same topic and you can prepare your literature
review.
▪ Journals/periodicals: The reason is that journals provide up-to-date
information which at times books cannot and secondly, journals can give
information on the very specific topic on which you are researching rather
talking about more general topics.
▪ Magazines/Newspapers: Magazines are also effective but not very reliable.
▪ General Websites (Google); generally, websites may or may not contain
very reliable information.
Common barriers in data collection
DATA PRESENTATION
Graphical or drawings
Data can be presented in form of graphs and diagrams or pictorial representations.
Graphs used for quantitative data to provide to provide a single glance on the data
for easier interpretations
Types of graphs include
▪ Histogram
▪ Line chart or graph
▪ Scatter or dot diagram.
Presentation of qualitative, discrete or counted data is through diagrams such as
▪ Bar diagram
▪ Pie chart or sector diagram
▪ Map diagram or spot map; show geographical distribution of data on the map
of the location where data was collected
▪ Pictogram or picture diagram
Histogram
The histogram represents the frequency distribution for quantitative data. The
different groups (variable characters) are indicated on the horizontal line (x-axis)
while frequency - number of observations is marked on the vertical line (y-axis). It
comprises adjacent bars or form of column or rectangle representing the data or
number of observations. The height of rectangles indicate the frequency
Conditions
Data presentation 9
Bar graph
This is used to represent categorical data and comprises of nonadjacent bars which
can be vertical or horizontal. Length of the bars, drawn vertical or horizontal,
indicates the frequency of a character. Exampled the marital status of respondents
was as follows
Status Number Percentage /%
Single 50 31.1
Married 86 53.4
Divorced 10 6.2
Widowed 15 9.3
Note: there can be multiple bars on the chart depending on the data being plotted
Data presentation 11
Pie charts
Pie chart depicts frequency distribution of categorical data in a circle (the “pie”),
with the sectors of the circle proportional in size to the frequencies in the respective
categories. Pie charts can be made highly attractive, by using color and three-
dimensional design enhancements, but become cumbersome if there are too many
categories
Scatter plot
This is a graphic presentation, made to show the nature of correlation between two
variable characters in the same sample or person such as hours and score from an
exam, weight and height etc. The data below shows hours spent studying and score
in biostatistics
Pictogram
This represents quantity by presenting stylized pictures or icons of the variable
being depicted – the number or size of the icon being proportional to the frequency.
When comparing between groups using a pictogram, it is preferable that same-sized
icons be used across groups (with their numbers varying) – otherwise the picture
may be misleading. Pictograms are more often used in mass media presentations
than in serious biomedical literature.
Stem-and-leaf plot or stem plot
This is a sort of mixture of a diagram and a table. It has been devised to depict
frequency distribution, as well as individual values for numerical data. This is a
simple way to order and also display
12 Biostatistics
The data values are examined to determine the first significant digits (the “stem”
item) on the left and their last significant digit (the “leaf” item) to the right
The stem items are usually arranged in ascending or descending order vertically, and
a vertical line is usually drawn to separate the stem from the leaf. The number of
leaf items should total up to the number of observations. However, it becomes
cumbersome with large data sets.
Question 1
The following shows a set of data on scores in biostatistics. 56, 55, 48, 78, 82, 90,
93, 66, 67, 69, 74, 79, 64, 92, 88, 66, 45, 74, 64, 58, 73, 40, 83, 84, 77, 88.
Construct a stem and leaf
Step 1: In order to create a stem and leaf plot, organize the data into groups
40, 45, 48
55, 56, 58
64, 66, 66, 67, 69
73, 74, 74, 77, 78, 79
82, 83, 84, 88, 88
90, 92, 93
Step 2: Create the plot with the stems as the tens and the leaves as the ones.
Stem Leaves
4 0, 5, 8
5 3, 6, 8
6 4, 6, 6, 7, 9
7 3, 4, 7, 8, 9
8 2, 3, 4, 8, 8
9 0, 2, 3
KEY: 4│0 = 40%
Question two: Your provided with the following data, draw the stem and leaf
diagram:
3.7, 5.4, 4.2, 0.5, 3.2, 0.7, 3.6, 1.3, 5.3, 3.9, 3.1, 2.5, 3.6, 1.6, 1.9, 3.6, 0.6, 0.5, 4.6,
4.9
Box plot
A box plot displays the five-number summary of a set of data. The five-number
summary is the minimum, first quartile, median, third quartile, and maximum. This
also referred to as whisker plot. The box plot is drawn from the first quartile to the
third quartile. A vertical line goes through the box at the median. The whiskers go
from each quartile to the minimum or maximum.
Measures of central tendency 13
THE MEAN
The mean is the arithmetic average which is the sum of the value of each
observation in a dataset divided by the number of observations. The mean is most
commonly used in statistical method
Properties of mean
▪ It is very easy to compute
▪ It may easily affected by the extreme scores
▪ It may not be an actual observation in the data set.
▪ Sum of each score’s distance from the mean is zero.
▪ It can be applied to interval level of measurement.
▪ It measures stability. Mean is the most stable among other measures of
central tendency because every score contributes to the value of the mean.
Limitations of the mean
▪ As the mean includes every value in the distribution the mean is influenced
by outliers/ extreme
▪ The mean cannot be calculated for categorical data, as the values cannot be
summed.
Calculations of mean for ungrouped data
A series of observations is indicated by the letter X and individual
observations by X1, X2, …, Xn. N the number of observations
𝑇𝑜𝑡𝑎𝑙 𝑠𝑢𝑚 𝑜𝑓 𝑡ℎ𝑒 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠 ∑𝑋
Mean = = X1+X2+X3.....Xn / n =
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠 𝑛
1 Here are the scores: 25, 20, 18, 17, 15, 14, 13, 15, 16, 12, 10. Find the mean in
the following scores. x (scores)
∑𝑋 25+ 20+18+17+15+ 14+13+15+16+12,+10
Mean = =
𝑛 11
175
= = 15.9
11
14 Biostatistics
THE MEDIAN
The median is the middle value in distribution when the values are arranged in
ascending or descending order
Median, therefore, is a better indicator of central value when one or more of the
lowest or the highest observations are wide apart or not so evenly distributed
The median is used when the exact midpoint of the score distribution is desired and
also when where there are extreme scores in the distribution.
Properties of the Median
▪ It may not be an actual observation in the data set; especially if two figures
lie in the middle
▪ It is not affected by extreme values because median is a positional measure.
▪ It can be applied in ordinal level.
Median of ungrouped data
▪ Arrange the scores (from lowest to highest or highest to lowest)
▪ Determine the middle most score in a distribution if n is an odd number and
get the average of the two middle most scores if n is an even number.
Median of grouped data
𝑛
−𝑐𝑓𝑝
2
Median value = LB + 𝑥 𝑐. 𝑖
𝑓𝑚
Where,
LB = lower boundary of the median class (MC)
cfp = cumulative frequency before the median class if the scores are arranged from
lowest to highest value
fm = frequency of the median class
c.i = size of the class interval Median of Grouped Data
Measures of central tendency 15
THE MODE
The mode is the most commonly occurring value in a data set distribution. Classified
as unimodal, bimodal, trimodal or mulitimodal.
▪ Unimodal is a distribution of scores that consists of only one mode.
▪ Bimodal is a distribution of scores that consists of two modes
▪ Trimodal is a distribution of scores that consists of three modes or
multimodal is a distribution of scores that consists of more than two modes.
Limitations of the mode
▪ In some distributions, the mode may not reflect the centre of the distribution
very well
▪ There to be more than one mode for the same distribution of data
▪ For continuous data, the distribution may have no mode at all (i.e. if all
values are different).
For ungrouped data
Consider this dataset showing the retirement age of 11 people, in whole years: 54,
55 54, 56, 56, 57, 57, 58, 57, 58, 60, 60. The most commonly occurring value is 54,
therefore the mode of this distribution is 57 years
For grouped data
𝑑1
Mode = LB + 𝑥 𝑐. 𝑖
𝑑1+𝑑2
Where,
LB = lower boundary of the modal class Modal Class (MC) = is a category
containing the highest frequency
d1 = difference between the frequency of the modal class and the frequency above
it, when the scores are arranged from lowest to highest.
d2 = difference between the frequency of the modal class and the frequency below
it, when the scores are arranged from lowest to highest.
c.i = size of the class interval
∑𝑓𝑋𝑚
Mean = 𝑛
𝑛
−𝑐𝑓𝑝
2
Median = LB + 𝑥 𝑐. 𝑖
𝑓𝑚
𝑑1
Mode = LB + 𝑑1+𝑑2 𝑥 𝑐. 𝑖
3 The results show the age of patients in years attended to at the pharmacy.
Age Frequency
0-9 15
10 - 19 26
20 - 29 23
30 - 39 13
40 - 49 17
50 - 59 34
60 - 69 19
70 - 79 8
80 - 89 11
a) Calculate the mean, median and mode
b) Draw a frequency distribution table
MEASURES OF LOCATION
PERCENTILES
Percentiles indicate the number at which a certain percentage of data falls below or a
value below which a certain percentage of observations lie. Percentiles split the data
into 100 equal parts, i.e., hundredths
For instance, the 40th percentile splits the data into the lower 40% of the values and
the upper 60% of the values. Percentiles are one version of measuring the variability
within a data set such as in continuous data – as height, age etc. 50 percentile
corresponds to median
The values in a series of observations arranged in ascending order of magnitude
which divide the distribution into 100 equal parts
To determine the percentile rank of x in data set
number of values below 𝑥
= *100, where n- number of all data set
𝑛
To determine the value of x existing at a certain percentile
𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒 (𝑛+1)
=( ) th
100
1. You are provided with the following data set: 2, 3, 4, 6, 7, 8, 9, 10, 11 and 12.
What is the percentile ranking of 7?
Note: If the data set is not arranged in ascending order, first arrange values.
Number below 7 = 4, n = 10
4
Percentile ranking of 7= x 100 = 40%
10
2. From same data set, what value exists at percentile ranking 60%?
60 (10+1)
Value at 60% = = 6.6th number
100
2, 3, 4, 6, 7, 8, 9, 10, 11, 12
Value at 60% = 6th + 0.6 (7th – 6th); 7th number =9, 6th =8
Value at 60% = 8 + 0.6 (9-8) = 8+0.6 = 8.6
3. You are provided with the following data set. 20, 5, 8, 5, 9, 17, 15, 11, 9, 10, 3
and 6. Find the value at percentile 40, 65 and 75
18 Biostatistics
DECILES
Deciles are a form of percentiles that split the data up into groups of 10%. Meaning,
every decile contains 10% of the data.
The first, second, …… ninth deciles by respectively. The fifth decile (corresponds to
median)
The second, fourth, sixth and eighth deciles which collectively divide the data into
five equal parts are called quintiles
Decile for ungrouped data
𝑛+1
Decile 1 (D1) = value of th item
10
2(𝑛+1)
Decile 2 (D2) = value of th item
10
4(𝑛+1)
Decile 4 (D4) = value of th item
10
9(𝑛+1)
Decile 9 (D9) = value of th item
10
1 For the data set, calculate the third and sixth deciles: 54 39 53 42 55 82 58
81 61 67 74 93 6870
Order: 39 42 53 54 55 58 61 67 68 70 74 81 82 93
3(𝑛+1)
Decile 3 (D3) = th
10
3 (14+1) 45
Decile 3 (D3) = = = 4.5th
10 10
4.5th = 4th + 0.5 (5th – 4th) = 54 + 0.5 (55 -54) = 54.5
QUARTILE
Similar to deciles, quartiles are a form of percentiles which split the data into
quarters such as 25th, 50th and 75th
They divide data into four parts. The first quartile, Q1, is referred to as the lower
quartile and the last quartile, Q4, is known as the upper quartile. Q1 splits the data
into the lower 25% of the values and the upper 75% of the values. The upper
quartile subdivides the data into the lower 75% of the values and the upper 25%
The difference between the upper quartile and the lower quartile is known as
the inter-quartile range, which indicates the spread of the middle 50% of the data.
Inter-quartile range = Q3 – Q1
𝑛+1
Quartile 1 (Q1) = value of th item
4
2(𝑛+1)
Quartile 2 (Q2) = value of th item
4
3(𝑛+1)
Quartile 3 (D3) = value of th item
4
Measures of central tendency 19
𝑖𝑥𝑛
Percentiles: Percentile class (pi) = value
100
𝑖𝑥𝑛
−𝑐𝑓
100
Qi = L + x class interval, i = 1, 2 , 20, 40 ......... 99
𝑓
𝑖𝑥𝑛
Deciles: Decile class (Di) = value
10
𝑖𝑥𝑛
−𝑐𝑓
10
Qi = L + x class interval, i = 1, 2 , 6 , .... 9
𝑓
Question
MEASURES OF DISPERSION
The measures of central tendency are not adequate to describe data. Two data sets
can have the same mean but they can be entirely different; thus to describe data, one
needs to know the extent of variability or measures of variation
Biological data; quantitative or qualitative, collected by measurement or counting
are very variable (vary from man to man and group to group) such as intelligence
quotient, behavior, physical characters etc.
The variability in a sample could be due to biological and experimnental methods.
Height, weight, blood pressure, WBC count, blood sugar, urea, cholesterol etc vary
depending on age, sex, social status or nature of work
Experimnental variability may be due errors by observer, intruments or equipments
used and sampling errors or bias. Measures of variability include
▪ Range
▪ Interquartile range
▪ Mean deviation
▪ Standard deviation
▪ Coefficient of variation
Other measures of variability of samples include: standard error of mean, standard
error of difference between two means, standard error of proportion, standard error
of difference between two proportions, standard error of correlation coefficient and
standard deviation of regression coefficient.
RANGE
Range is the simplest measure of dispersion and indicates the distance between the
lowest and the highest. The range is the difference between the largest and the
smallest observation in the data. This is a common biological characteristic and
defines the normal limits such as blood pressure, blood sugar levels, menstrual cycle
days, bilirubin levels
The prime advantage of this measure of dispersion is that it is easy to calculate. On
the other hand, it has lot of disadvantages; very sensitive to outliers and does not use
all the observations in a data set.
INTERQUARTILE RANGE
Interquartile range is defined as the difference between the 25th and 75th percentile
(also called the first and third quartile). Hence the interquartile range describes the
middle 50% of observations. If the interquartile range is large it means that the
middle 50% of observations are spaced wide apart.
The important advantage of interquartile range is that it can be used as a measure of
variability if the extreme values are not being recorded exactly and is also not
affected by extreme values.
Measures of dispersion 21
1. Given the following data of score on mid-term test. Find the interquartile
range: 7, 7, 5, 11, 2, 8, 6, 3, 10, 11, 4
Step 1: 2, 3, 4, 5, 6, 7, 7, 8, 10, 11, 11
Step 2: Median – 7
Step 3: 2, 3, 4, 5, 6 │7│7, 8, 10, 11, 11
Median for Q1 = 2, 3, 4, 5, 6 = 4
Median for Q3 = 7, 8, 10, 11, 11 = 10
Step 4 = IQR = Q3 – Q1 = 10 – 4 = 6
1. Find IQR
Score F Cf Cb
20-29 4 4 19.5-29.5
30-39 8 12 29.5-39.5
40-49 20 32 39.5-49.5 Q1
50-59 16 48 49.5-59.5 Q3
60-69 9 57 59.5-69.5
70-79 3 60 69.5-79.5
1
Q1 = x 60 = 15th score (in class of 39.5-49.5)
4
3
Q3 = x 60 = 45th score (in class of 49.5-59.5)
4
15−12
Q1 =39.5 + ( ) 10 = 41
20
45−32
Q3 =49.5 + ( ) 10 = 57.6
16
IQR = Q3 – Q1 = 57.6 – 41 = 16.6
2 Find the IQR
Interval frequency
10-19 5
20-29 4
30-39 15
40-49 13
50-59 12
60-69 9
70-79 11
Measures of dispersion 23
MEAN DEVIATION
This defines how far on average all values are from the middle. Procedure for
calculation
1. Find the mean of the values (X) provided - ẍ
2. Find the distance of each value from the mean or subtract all from the mean
(X - ẍ). Ignore all the negative signs and take them as positive
3. Add all the values (X - ẍ) to get ∑(X - ẍ)
∑(X − ẍ)
4. Mean deviation = ; where n is number of values or data provided
𝑛
Note: This is not commonly used in statistical analysis
STANDARD DEVIATION
This is the most commonly used measure of dispersion; a measure of spread of data
about the mean. SD is the square root of sum of squared deviation from the mean
divided by the number of observations. Procedure for calculation
∑𝑋
1. Calculate the mean; ẍ =
𝑛
2. Find the difference of each observations from the mean; (X - ẍ)
3. Square the differences of observations from the mean (X - ẍ)2
4. Add the squared values to get the sum of squares of the deviation ∑(X - ẍ)2
5. Divide this sum by the number of observations minus one to get mean-
𝑆𝑞𝑢𝑎𝑟𝑒 ∑(X − ẍ)
squared deviation; Variance (σ2) =
𝒏
6. Find the square root of this variance to get root-mean squared deviation,
called standard deviation.
Note: Having squared the original, reverse the step of taking square root.
SD = √𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒
Where ∑X – sum of the data provides and ∑Ẋ2 = sum of square of data and n –
number of data provided
A large SD shows that the measurements of the of the frequency distribution are
widely spread out from the mean Small SD means the observations are closely
spread
24 Biostatistics
SD summarizes the deviations of a large distribution from mean in one figure used
as a unit of variation and used in finding out the suitable sample size
1 The following are respiratory rates per minute at certain hospital: 18, 15,
20, 21, 19, 17, 23, 16, 25, 24, 26 and 14. Find the mean respiratory rate per
minutes and standard deviation
X (X - ẍ) (X - ẍ)2
18 -1.8 3.24
15 -4.8 23.04
20 0.2 0.04
21 1.2 1.44
19 -0.8 0.64
17 -2.8 7.84
23 3.2 10.24
16 -3.8 14.44
25 5.2 27.04
24 4.2 17.64
26 6.2 38.44
14 -5.8 33.64
∑=238 ∑=177.68
∑𝑋 238
Mean = = = 19.8 respiratory rates per minute
𝑛 12
SD = √𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒
177.68 177.68
SD = √ =√ = √16.15= 4.02
12−1 11
1. Find SD
Interval f m fm fm2
5-9 9 7 63 441
10-14 7 12 84 1008
15-19 10 17 170 2890
20-24 13 22 286 6292
25-29 7 27 189 5103
30-34 14 32 448 14336
35-39 8 37 296 10952
40-44 5 42 210 8820
45-49 6 47 282 13254
∑=79 ∑=2028 ∑=63096
2028 x2028
63096−
SD = √ 79
= 11.89
79−1
26 Biostatistics
Interval f
10-19 5
20-29 4
30-39 15
40-49 13
50-59 12
3. Table below show the daily commuting times (in minutes). Calculate
variance and standard deviation
Times Number of workers
5-14 25
15-24 14
25-34 32
35-44 17
45-54 10
COEFFICIENT OF VARIATION
This is the ratio of the standard deviation to the mean. The value is expressed as a
percentage and used to compare the variability of one character in two different
groups having different magnitude of values such as height in adult and children and
others
𝑆𝐷
Coefficient of Variation = x 100
𝑀𝑒𝑎𝑛
1. In a research, the following data was obtained about mean weight and SD in
children and adults. Calculate the coefficient of variation or series shows greater
variation
Mean weight SD
Children 23kg 5
Adult 55kg 8
5
CV children = x 100 = 23.7%
23
8
CV adult = x 100 = 14.5%
55
The weight in children shows greater variation
2. Provided with following data
Range (grams) Frequency
90-99 5
100-109 4
110-119 8
120-129 17
120-139 12
140-149 5
150-159 8
160-169 6
Calculate the Coefficient of Variation and Draw a frequency polygon
Measures of frequency 27
These are studies which can be done in a population (a group of people with some
common characteristic, such as age, race, gender, or place of residence).
There two classifications
Measures of disease frequency in mathematical quantity
▪ Counts
▪ Proportion (percentage)
▪ Rate
▪ Ratio
Measures of disease frequency in epidemiology
▪ Prevalence
▪ Incidence
Counts
This is the simplest and the most basic measure; defined as absolute number of
persons with have characteristic of interest or disease
It is an important basic measure of disease frequency that is essential to detecting
trends or the sudden occurrence of a problem, such as an epidemic. Such as
▪ 530 cases of Covid-19 case in Kampala
▪ 60 confirmed ebola cases reported in Kivu – DRC
▪ 136 case of acute diarrhea cases reported in camp
▪ 120 cases of cholera in IDP camps
Simple counts of the number of diseased people are also
▪ Important to public health planners and policy makers for assessing the need
and allocation for resources in a population
▪ Used in surveillance of infectious disease for early detection of outbreaks
However, counts, usually depend on population or risk area (the bigger the size, the
higher the number of cases. This is not true always). The duration of observation
also affects the frequency of cases; the longer the observation period, the more cases
can occur.
Ratio
This is a value obtained by dividing one number by another
(either related or unrelated). A ratio doesn't necessarily imply any particular
relationship between the numerator and the denominator. A ratio is obtained by
dividing number of one event to the number of another event.
After a study, there were 125 men and 175 women. The ratio 125: 175 = 5:7. The
ratio of men: women = 5: 7
Proportion
A type of ratio that relates a part to a whole; often expressed as a percentage (%)
If there are 1216 female surveyed, only 243 reported using contraceptives. The
243
proportion of women who use of contraceptives = 𝑥 100 = 20%
1216
28 Biostatistics
Rate
This is frequency of events that occur in a defined time period, divided by the
average population of risk. In general population, examples of rate are (birth
malformation rate, crude death rate, smoking rate) in reality all these measures are
proportions.
adults is actually the number of adults
in a population who smoke
Smoking rate =
the total number of adults in the population
Prevalence
Prevalence measures the proportion of individuals in a defined population that have
a disease or other health outcomes of interest at a specified point in time (point
prevalence) or during a specified period of time (period prevalence).
𝑁𝑜. 𝑜𝑓 𝑐𝑎𝑠𝑒𝑠 𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 𝑎𝑡 𝑡𝑖𝑚𝑒 𝑖𝑛 𝑑𝑒𝑓𝑖𝑛𝑒𝑑
𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛
Prevalence = 𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑒𝑟𝑠𝑜𝑛𝑠 𝑖𝑛 𝑑𝑒𝑓𝑖𝑛𝑒𝑑 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛
𝑎𝑡 𝑡ℎ𝑒 𝑠𝑎𝑚𝑒 𝑝𝑜𝑖𝑛𝑡 𝑖𝑛 𝑡𝑖𝑚𝑒
Uses
▪ Prevalence is a useful measure for quantifying the burden of disease in a
population at a given point in time
▪ Useful in planning health services
Of 20,563 residents in town Kampala on 1st June 2021, 657 tested positive for
corona virus.
657
The prevalence of corona virus = 𝑥 100 = 3.2%
20563
Incidence
This is a measure of the number of new cases of a disease or any other health
outcome of interest that develops in a population at risk during a specified time
period. There are two main measures of incidence: Cumulative incidence and
incidence rate
Cumulative incidence (CI)
This is related to the population at risk at the beginning of the study period. This is
the proportion of individuals in a population (initially free of disease) who develop
the disease within a specified time interval. Incidence risk is expressed as a
percentage (or if small as per 1000 persons).
▪ This is also referred to as risk
𝑁𝑜 𝑜𝑓 𝑛𝑒𝑤 𝑐𝑎𝑒𝑠 𝑜𝑓 𝑎 𝑑𝑖𝑠𝑒𝑎𝑠𝑒 𝑖𝑛 𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑒𝑑 𝑝𝑒𝑟𝑖𝑜𝑑 𝑜𝑓 𝑡𝑖𝑚𝑒
CI= 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑖𝑠𝑒𝑎𝑠𝑒 𝑓𝑟𝑒𝑒 𝑝𝑒𝑟𝑠𝑜𝑛𝑠 𝑎𝑡 𝑏𝑒𝑔𝑖𝑛𝑛𝑖𝑛𝑔 𝑜𝑓 𝑡ℎ𝑎𝑡 𝑝𝑒𝑟𝑖𝑜𝑑 𝑂𝑅
𝑁𝑜 𝑜𝑓 𝑖𝑛𝑑𝑖𝑣𝑖𝑑𝑢𝑎𝑙𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑎𝑡 𝑡ℎ𝑒 𝑏𝑒𝑔𝑖𝑛𝑛𝑖𝑛𝑔
Measures of frequency 29
1. At the beginning of May 2021, there were 237 patients in a certain hospital.
By 15th May 2021, 39 patients had tested positive for corona virus. What is
the cumulative incidence of corona virus?
39
CI = x 100 = 16.5%
237
2. The population of Kyenjojo Village was 18,922 in Feb 2021. Among them
1253 were had contracted malaria by Apr 2021. What is cumulative
incidence of malaria?
Incidence rate
This is the measure of the frequency of new cases of disease in a population and
takes into account the sum of the time that each person remained under observation
and at risk of developing the outcome under investigation. Has dimensions, unit is
time usually persons-year
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑛𝑒𝑤 𝑐𝑎𝑠𝑒 𝑜𝑓 𝑑𝑖𝑠𝑒𝑎𝑠𝑒 𝑖𝑛 𝑎 𝑔𝑖𝑣𝑒𝑛 𝑡𝑖𝑚𝑒 𝑝𝑒𝑟𝑖𝑜𝑑
Incidence rate =
𝑇𝑜𝑡𝑎𝑙 𝑝𝑒𝑟𝑠𝑜𝑛−𝑡𝑖𝑚𝑒 𝑎𝑡 𝑟𝑖𝑠𝑘 𝑑𝑢𝑟𝑖𝑛𝑔 𝑡ℎ𝑒 𝑓𝑜𝑙𝑙𝑜𝑤−𝑢𝑝 𝑝𝑒𝑟𝑖𝑜𝑑
The incidence rate is the rate of contracting the disease among those still at risk.
When a study subject develops the disease, dies or leaves the study they are no
longer at risk and will no longer contribute person-time units at risk.
Person-time at risk is a measure of the number of persons at risk during the given
time-period determined by summing of the results of events divided by the time
1. In 2020, the new cases of HIV was 35 among the youth aged 18 -25 years in
Kyegegwa. The person years among that group was 3,525. Calculate the
incidence rate
Prevalence vs incidence
The proportion of the population that has a disease at a point in time (prevalence)
and the rate of occurrence of new disease during a period of time (incidence) are
closely related
Prevalence depends on:
▪ The incidence rate
▪ The duration of disease
For example, if the incidence of a disease is low but the duration of disease (i.e. until
recovery or death) is long, the prevalence will be high relative to the incidence.
For example diseases like tuberculosis tend to persist for a longer duration, from
months to years, hence the prevalence (old and new cases) would be longer than the
incidence.
Conversely, if the incidence of a disease is high and the duration of the disease is
short such as diarrhea, the prevalence will be low relative to the incidence
:. Prevalence = Incidence x Duration
30 Biostatistics: By Mwesigwa Wilson
Other include
Morbidity: This is the state of having a specific illness of disease condition.
Morbidity is used to define the health of large population
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑒𝑟𝑠𝑜𝑛𝑠 𝑤𝑜𝑡ℎ 𝑑𝑖𝑠𝑒𝑎𝑠𝑒 𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛
= 𝑥 1000
𝑁𝑜 𝑜𝑓 𝑝𝑒𝑟𝑠𝑜𝑛𝑠 𝑖𝑛 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑎𝑡 𝑟𝑖𝑠𝑘
2. Out of 2863 births in western Uganda, only 134 mothers died during
childbirth. Calculate the maternal mortality rate
SAMPLING
In research, it is rarely possible to collect data from every member of the population
(usually populations are so large that a researcher cannot examine the entire group);
instead, a sample is selected from the group to act as representative
Sampling
This is a technique of selecting individual members or a subset of the population to
make statistical inferences from them and estimate characteristics of the whole
population.
Population: The entire group of individuals
Sample
This is the selected group or items to represent the population in a research study or
specific group of individuals or items for data collection. The goal is to use the
results obtained from the sample to help answer questions about the population.
Sampling frame
This is the actual list of individuals that the sample will be drawn from. If you are
doing research on working conditions at Pharmaceutical manufacturing plant. The
population is all 1152 employees of the industry. The sampling frame is the
industry’s HR database
For example, if a drug manufacturer would like to research the adverse side effects
of a drug on the country’s population, it is almost impossible to conduct a research
study that involves everyone. In this case, the researcher decides a sample of people
from each demographic and then researches them, giving him/her indicative
feedback on the drug’s behavior
Sample size: the actual number of individuals to be included in the study
Purpose of sampling
▪ Sampling makes possible the study of a large, (different characteristics)
population.
▪ Sampling saves resources
▪ Sampling saves time for studying the entire population.
SAMPLING METHODS
▪ Probability sampling.
▪ Non-probability sampling
PROBABILITY SAMPLING
In this sampling method, all the members have an equal opportunity to be a part of
the sample. This is a sampling technique where a researcher sets a selection of a few
criteria and chooses members of a population randomly. The method is based on the
theory of probability
32 Biostatistics
NON-PROBABILITY SAMPLING
In this sampling method, not every member has equal chance of being selected or
include in the sample.
Types include
▪ Judgmental or purposive sampling
▪ Convenience sampling
▪ Snowball sampling
▪ Quota sampling
▪ Consecutive sampling
Convenience sampling
A convenience sample simply includes the individuals who happen to be most
accessible to the researcher such as surveying patients at certain hospital or passers-
by on a busy street. Researchers have nearly no authority to select the sample
elements, and it’s purely done based on proximity and not representativeness.
Convenience sampling is the option that’s most useful for pilot testing; may be
referred to as accidental, opportunity, or grab sampling
Data gathering with this method comes from people that are the easiest to reach or
contact. No criteria are in place for this sampling method beyond the willingness
and availability of people to participate in the work.
Advantages
▪ The sampling method is easier
▪ Affordable or cheap way to gather data.
Saves time when gathering data
▪ Useful as an intervention to correct dissatisfaction.
Provides qualitative information.
Disadvantages
▪ Sample doesn’t provide a representative result of the entire population
▪ May provide false data
▪ Results obtained may not be replicated in other settings
▪ Potential for bias during data collection
Judgmental or purposive sampling
This involves the researcher using their expertise to select a sample that is most
useful to the purposes of the research. It is often used in qualitative research, where
the researcher wants to gain detailed knowledge about a specific phenomenon rather
than make statistical inferences, or where the population is very small and specific.
An effective purposive sample must have clear criteria and rationale for inclusion.
For instance, when researchers want to understand the thought process of people
interested in studying for their master’s degree. The selection criteria will be: “Are
you interested in doing your masters in …?” and those who respond with a “No” are
excluded from the sample.
Sampling 35
Snowball sampling
This is a method used if the population is hard to access. Snowball sampling can be
used to recruit participants via other participants.
The sampling technique can be extensively used for conducting qualitative research,
with a population that is hard to locate. The sampling method is used in locating
samples hard to get such as get one participant who identifies other people in the
same situation as themselves and could inform others about the benefits of the study
and reassure them of confidentiality. Such as study in carried out in drug abusers,
sex workers, homelessness, HIV patients etc.
Advantages
▪ Allows for studies to be carried out in situations where there may be limited
participants or hard to tract participants
▪ Cost effective as the referrals are obtained from a primary data source
▪ Quicker to find participants as they come from reliable sources
Disadvantages:
▪ Some participants may refuse to participate in the study
▪ The sample obtained may be small hence potential sampling bias and margin
of error; and the study may provide inconclusive results.
Consecutive sampling
This is similar convenience sampling to where participants are selected at the ease of
a researcher. The research is then conduct over a period of time and data is collected
and then moves on to another sample.
Quota sampling
In this method, the sample size is determined first and then quota is fixed for various
categories of population, which is followed while selecting the sample. In this case,
as a sample is formed based on specific attributes, the created sample will have the
same qualities found in total population
The method reduces cost of preparing sample and field work, since ultimate units
can be selected so that they are close together. However, there may be bias in
sample selection
36 Biostatistics
Activity
✓ Explain factors which determines the sample size for the study
Where:
▪ e is the desired level of precision (the margin of error),
▪ p is the (estimated) proportion of the population which has the attribute in
question,
▪ q is 1 – p.
The z-value is found in a Z table. The confidence level corresponds to a Z-score
Confidence level Z-score
90% 1.645
95% 1.96
99% 2.576
Calculate the sample size given p=0.5, precision of 5% and confidence level of 95%
Z = 1.96 p = 0.5 q= 1-0.5 = 0.5, e = 5% ≈0.05
STUDY DESIGNS
This is a plan to conduct research or study which allows for the translation of
conceptual hypothesis into an operation hypothesis
Study designs are method of data collection of specific variable in a population with
respect to time, exposure and outcomes while others involve studying interventions
or exposures applied to specific groups
Activity
✓ Explain the factors to consider when choosing a study design
Case series informs patients and physicians about history and prognosis and is
inexpensive to carry out plus also used in hypothesis generation. However the case
may not representative of the population
Cross-sectional study
A study that examines the relationship between diseases (or other health-related
characteristics) and other variables of interest as they exist in a defined population at
one particular time (i.e exposure and outcomes are both measured at the same time).
Best for quantifying the prevalence of a disease or risk factor, and for quantifying
the accuracy of a diagnostic test.
In cross-section study data is collected at one point in time
Advantages:
▪ Easy or simple to perform
▪ It takes less time to perform
▪ Cheap or inexpensive as compared to analytical studies
▪ Ethically safe
▪ Hypothesis can be easily generated
▪ Useful in determining the prevalence of the disease
Disadvantages:
▪ Establishes association at most, not causality or cause and effect
▪ There may be bias
ANALYTICAL OBSERVATIONAL STUDY DESIGNS
The study designs are useful in testing the etiological hypothesis such as
▪ If there any association between exposure or risks and outcome of disease or
determine if the association is not by chance
▪ The strength of association
Sub-types include
▪ Case control studies
▪ Cohort studies
Case-control study
This is a form of observational study that aims to identify risk factors for developing
the outcome of interest. Subjects with the outcome (cases) and without the outcome
(controls) are selected and risk factor exposure measurements are collected
retrospectively in both groups either from the subject or from any available records.
Cohort study
Cohort studies evaluate a possible association between exposure and outcome by
following a group of exposed individuals over a period of time (often years) to see
whether they develop the disease or outcome of interest. A cohort is a group of
individuals who share a common characteristic, and may be chosen based on a
population definition, or based on a particular exposure
Study designs 39
This involves pre-test for the study group to determine the baseline information. The
study group is then randomly assigned to both test or experimental (given the
treatment) and control group (with-out treatment) under the same conditions.
A post-test is carried out on the members of the two groups to determine the effect
of intervention or treatment or some-times the cause effect
Merits
▪ Unbiased distribution of confounders
▪ Blinding more likely
▪ Randomization facilitates statistical analysis
Disadvantages:
▪ Expensive to conduct
▪ It takes time
▪ Ethically problematic at times.
Quasi-experimental design
The prefix quasi means “resembling.” Thus quasi-experimental research is research
that resembles experimental research but is not true experimental research.
Quasi-experimental design aims to establish a cause-and-effect relationship between
an independent and dependent variable.
However, unlike a true experiment, a quasi-experiment does not rely on random
assignment. Instead, subjects are assigned to groups based on non-random criteria.
BLINDING
Different factors may affect the outcome or results between two variables under
study
Bias
A systematic error in the design, recruitment, data collection or analysis that results
in a mistaken estimation of the true effect of the exposure and the outcome.
Bias limits validity (the ability to measure the truth within the study design) and
generalizability (the ability to confidently apply the results to a larger population) of
study results. Groups or categories of bias include: selection and information bias
Information bias
This is error due to inaccurate measurement or the way data is obtained from
different study groups. Errors in measurement are also known as misclassification.
Sub-types include
Observer bias: result of the investigator’s prior knowledge of the hypothesis under
investigation or knowledge of an individual's exposure or disease status. Such
information may influence the way information is collected, measured or
interpretation by the investigator for each of the study groups.
For example, in a trial of a new medication to treat blood sugar, if the investigator is
aware which treatment arm participants were allocated to, this may influence their
reading of blood sugar levels. Observers may underestimate the blood sugar levels
in those who have been treated, and overestimate it in those in the control group.
Interviewer bias: occurs where an interviewer asks leading questions that may
systematically influence the responses given by interviewees
Recall (or response) bias: this is common in a case-control study where data on
exposure is collected retrospectively. Recall bias may occur when the information
provided on exposure differs between the cases and controls or in individuals who
can’t remember exposures accurately hence participant of the study may provide
inaccurate information
Missing data bias: this is due to certain information not recorded per participant
Social desirability bias: occurs where participants in study answer in a manner they
feel will be seen as favorable by others or give answers the investigator wants to
hear:
Detection bias: occurs due different techniques used to measure the outcome from
two groups
Reporting bias: individuals selectively suppress or reveal information such as
smoking history
Instrument bias: this is where an inadequately calibrated measuring instrument
systematically over/ under-estimates measurement.
42 Biostatistics
Minimizing bias
▪ Where possible, observers should be blinded to the exposure and disease
status of the individual
▪ Use of standardized questionnaires or pretested questionnaires
▪ Use calibrated instruments, such as sphygmomanometers.
▪ Training of interviewers.
▪ All data should be entered
▪ Development standard tool for collection, measurement and interpretation of
information.
Selection bias
This occurs when the study population is not representative of the target population
so that the measure of variable does not accurately represent the target population to
which conclusions are being extended.
Selection bias is a potential problem wherever individuals are selected for inclusion
in a study because of the presence or absence of certain characteristics.
Confounding
A situation in which the effect or association between an exposure and outcome is
distorted by the presence of another variable. Confounder is an extraneous variable
that wholly or partially accounts for the observed effect of a risk factor on disease
status. The presence of a confounder can lead to inaccurate results.
▪ Positive confounding is when the observed association is biased away from
the null. In other words, it overestimates the effect.
▪ Negative confounding is when the observed association is biased towards
the null. In other words, it underestimates the effect.
Effect modification
This is variable that differentially (positively and negatively) modifies the observed
effect of a risk factor on disease status. Different groups have different risk
estimates when effect modification is present
Normal distribution curve 43
1 If at certain institution, the mean weight of the students was 55Kg and standard
deviation of 5kg. Determine the lower and upper limits at 68%, 95% and 99%
Mean ± 1 SD = 55± 5 ≈ 50 to 60kg contains 68% all the observation.
Mean ± 2 SD = 55± 5 x 2 = 55±10 ≈ 45 to 65kg, will include 95% of
observations
Mean ± 3 SD = 55± 5 x 3 = 55±15 ≈ 40 to 70kg, will include 99.7% of
observations
44 Biostatistics
Positive Skewness
This means when the tail on the right side of the distribution is longer or fatter. A
positively skewed (or right skewed) distribution is a type of distribution in which
most of values are clustered around the left tail of the distribution and while the right
tail of distribution is longer. The mean is on the right of the peak value. The mean
and median will be greater than the mode.
Negative Skewness
This is when the tail of the left side of the distribution is longer or fatter than the tail
on the right side. The mean and median will be less than the mode
Kurtosis
This is a measure of whether the data are heavy-tailed or light-tailed relative to a
normal distribution. In other words, kurtosis identifies whether the tails of a given
distribution contain extreme values.
That is, data sets with high/large kurtosis tend to have heavy tails, or outliers
exceeding the normal distribution (ie 3 or more standard deviations from the mean).
Data sets with low kurtosis tend to have light tails, or lack of outliers than tails of
normal distribution
The standard normal distribution has a kurtosis of 3
46 Biostatistics
Types of kurtosis; an excess kurtosis defines the types of kurtosis; they include
Mesokurtic
Data that shows an excess kurtosis of zero or close to zero. This distribution is
shows tails nearly similar to that of normal distribution
Leptokurtic
The prefix of "lepto-" means "skinny”. Data that indicates a positive excess kurtosis;
shows heavy tails on either side, indicating long tails or large outliers which stretch
the horizontal axis of the graph, making the bulk of the data appear in a narrow
(skinny) vertical range. The kurtosis is greater than 3
Platykurtic
The prefix of "platy-" means "broad”. Data with this distribution shows a negative
excess kurtosis; with small outliers in a distribution making flat short tails
Permutation and combinations 47
PERMUTATIONS
When a letter occurs more than once in a word, divide the factorial of the number of
all letters in the word by the number of occurrences of each letter.
1 Find the number of words, with or without meaning, that can be formed
with the letters of the word ‘DRUGS’
‘DRUGS’ contains 5 letters.
Therefore, the number of words that can be formed with these 5 letters = 5!
= 5*4*3*2*1 = 120.
COMBINATIONS
This is the way of selecting items from a collecting or the different selections
possible from a collection of items such that (unlike permutations) the order of
selection does not matter
The different selections possible from the alphabets A, B, C, taken 2 at a time, are
AB, BC and CA.
It does not matter whether we select A after B or B after A. The order of selection is
not important in combinations
A combination is the choice of r things from a set of n things without replacement
and where order does not matter.
To find the number of combinations possible from a given group of items n, taken r
at a time, the formula, denoted by nCr is
n 𝑛𝑃𝑟 n!
Cr = =
𝑟! r! ∗ (n−r)!
n n n
Cn = 1 C0 = 1 C1 = n
n
Cr = nC(n-r)
For example, verifying the above example, the different selections possible from the
alphabets A, B, C, taken two at a time are
3 3! 3x 2!
C2 = = = 3 possible selections (AB, BC, CA)
2! ∗ (3−1)! 2! ∗ 1!
Permutation and combinations 49
12 12! 12! 12 𝑥 11 𝑥 10 𝑥 9 𝑥 8!
C4 = = =
4!𝑥 (12−4)! 4!𝑥 8! 4 𝑥 3 𝑥 2 𝑥 1 𝑥 8!
11880
= = 495
24
INTRODUCTION TO PROBABILITY
DEFINITIONS
Equally likely or symmetrical events: A set of events are said to be equally likely if
none has any preference of occurrence over the other thus all have the same chances
of occurrence or when there is no reason to expect the happening of one event in
preference to the other.
For example; when an unbiased coin is tossed the chances of getting a head or a tail
are the same.
Exhaustive events: All the possible outcomes of the experiments are known as
exhaustive events.
Favorable events: The outcomes which make necessary the happening of an event in
a trial are called favorable events. For example; if two dice are thrown, the number
of favorable events of getting a sum 5 is four, i.e., (1, 4), (2, 3), (3, 2) and (4, 1).
Simple events: This is when an event cannot be narrowed down into simpler events.
Such as during diagnosis; the presence of either diabetes and hypertension are
simple events
Compound events: This is when an event can further be disintegrated or narrowed
into simpler events. Such as during diagnosis; the presence of both diabetes and
hypertension are compound events
Conditional probability: This is the probability that an event occurs given that
another has already occurred
Joint probability: This is probability of the intersection of two events or the
probability that a subject picked at random from a group of subjects possesses two
characteristics at the same time
Law of large number: If an experiment is repeated again and again, the probability
of an event obtained from the relative frequency approaches the actual or theoretical
probability
Marginal or simple probability: This is the probability of a single event without
consideration of any other event
Subjective probability: This is the probability assigned to an event based on
subjective judgement, experience, belief and information
Probability of occurrence of an event:
Notation for probability
Probability is expressed by the symbol ‘p’. A, B and C denote specific events
▪ P(A) denotes probability of event A occurring
▪ P(B) denotes probability of event B occurring
▪ P(C) denotes probability of event C occurring
It ranges from zero (0) to one (1) and the sum of probabilities of all possible
outcomes of an event is equal to 1. Probability is always positive proper fraction
When p = 0, it means there is no chance of an event happening or its occurrence is
impossible.
If p = 1, it means the chances of an event happening are 100%
If the probability of an event happening in a sample is p and that of not happening is
denoted by the symbol q, then p + q = 1 or q = 1 – p
52 Biostatistics
2. From a pack of cards, the probability of drawing any one of 4 aces in one
attempt from pack of 52 cards;
4 1
P= = .
52 13
From a pack of cards, the probability of not drawing any one of 4 aces in
one attempt from pack of 52 cards
1 12
q= 1-p = 1- ; q= .
13 13
3. The chance of getting male or female child in one pregnancy are fifty-fifty
or half and half; p = ½ or 0.5 and q = ½ or 0.5
4. In a certain village, if twins were born twice out 126 different pregnant
women;
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑤𝑖𝑛𝑠 𝑏𝑖𝑟𝑡ℎ 2 1
P= = =
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑟𝑒𝑔𝑛𝑎𝑛𝑡 𝑤𝑜𝑚𝑒𝑛 126 63
1 62
The probability of only single birth q=1- =
63 63
5. Out of 556 different drugs in a pharmacy, profits are made only 440 drugs.
Calculate the probability of picking a profit making drug during dispensing
6. At cancer institute; only 4 cancer patients died out of 240 patients in the
month of March. Calculate the probability of survival and death of cancer
patients in the month of March
7. The results from a research indicate; out of 1038 randomly selected adults,
52 believe that second-hand smoke is not at all harmful. Calculate the
probability of selecting a person who believes that second-hand smoke is
not at all harmful
8. 1
If the probability of being rhesus negative is , What is the probability of
10
being rhesus positive
Probability 53
LAWS OF PROBABILITY
If the number of mutually exclusive events are n and P1 in the individual probability
then total probability, P is calculated as P = p1 + p2, ..., + pn = 1
6. The probability of getting male or female child is ½. The total probability
is ½ + ½ = 1
7. 1
In one cut, chance of getting king of hearts is and of getting any of the
52
four kings will be;
1 1 1 1 4 1
= + + + = =
52 52 52 52 52 13
54 Biostatistics
Addition law when events are not mutually exclusive; if events A and B are not
mutually exclusive, then the probability of at least one of the events A and B is
calculated as
P(A or B)=P(A+B)=P(A∪B) = P(A)+P(B)-P(AB)
This means
Probability of A plus B = Probability of A plus probability of B – Probability of
both A and B
8. You are provided with the following information; about diseases condition
by patients (Diabetic and Hypertensive patients)
Diabetic Non-diabetic Total
Hypertensive 48 74 122
Non-Hypertensive 188 80 268
Total 236 154 390
a) Probability of being Diabetic
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑏𝑒𝑖𝑛𝑔 𝑑𝑖𝑎𝑏𝑒𝑡𝑖𝑐 236
P= = = 0.605
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑎𝑡𝑖𝑒𝑛𝑡𝑠 390
If events A and B are mutually exclusive: P(A and B) = 0. Such as being heavy
smoker and getting lung cancer
Probability 55
This shows all the possible events. The first event is represented by a dot. From the
dot, branches are drawn to represent all possible outcomes of the event. The
probability of each outcome is written on its branch
There are two forms
▪ Probability with replacement
▪ Probability without replacement
Steps in tree diagrams
1. Draw the Probability Tree Diagram and write the probability of each branch
2. Look for all the available paths (branches) of a particular outcome
3. Multiply along the branches and add vertically to find the probability of the
outcome as shown below in the examples
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑢𝑐𝑐𝑒𝑠𝑠𝑓𝑢𝑙 𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝑠 𝑜𝑓 𝑤𝑎𝑛𝑡𝑒𝑑 𝑖𝑡𝑒𝑚
Theoretical probability =
𝑇𝑜𝑡𝑎𝑙 𝑝𝑜𝑠𝑠𝑖𝑏𝑙𝑒 𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝑠
2. A box contains 8 blue and 4 red balls. A is drawn at random and then
replaced. A second ball is then also drawn at random. Calculate the
probability of getting
a) At least one blue: hint P(B, B) or P(B, R) or P(R, B)
b) One red and one blue: hint P(B,R ) or P(R, B)
c) Two of the same color: hint P(B, B) or P(R, R)
3. 16 capsules are added a tin in a Pharmacy with 7 white and the rest being
green. A capsule is picked at random and returned, the a second pick is
made
Find the probability
a) Both capsules are green
b) One capsule is white and another is green
4. There are 7 blue tablets and 5 yellow tablets in a small container all used
in treatment of same condition but from different manufacturers. A tablet
is drawn times from the container while replacing back. Draw a
probability tree and find the probability of picking
a) All blue b) Two blue and one yellow c) blue, yellow and blue
d) blue and two yellow e) All yellow
Probability without replacement
After one element is picked from the container, and not replaced. The sample space
for the next picking for the next picking will be less by 1
For examples if a container has 13 elements, if 1 is picked and not replaced. The
sample space will be 12 for the next picking
1. A container consists of 21 balls; 12 are green and 9 are blue. Picked two
balls are picked at random.
Note:
After 1 green is picked, 20 balls remain: 11 green and 9 blue
After 1 blue is picked, 20 balls remain: 12 green and 8 blue
Find the probability that
a) both balls are blue
P(Blue and Blue) = P(B) x P(B)
9 8 72 6
= 𝑥 = =
21 20 420 35
b) One ball is blue and one ball is green.
c) If a third ball is drawn, find the probability it is
i) green
ii) At least 1 is blue (Hint elaborate the tree diagram to third draw)
58 Biostatistics
2. 16 capsules are added a tin in a Pharmacy with 7 white and the rest being
green. A student picks two capsules at random without replacement.
Find the probability
a) Both capsules are green
b) One capsule is white and another is green
3. A patient has a bag containing 7 yellow lozenges and 5 red lozenges. He
eats one lozenge at a go and then a second one after some time
Find the probability that patient’s eats
a) Yellow first and a red lozenge second
b) Two red lozenges
c) Two lozenges with the same color
4. There are 7 blue tablets and 5 yellow tablets in a small container all used
in treatment of same condition but from different manufacturers. A tablet
is drawn times from the container without replacing back. Draw a
probability tree and find the probability of picking
a) All blue b) Two blue and one yellow c) blue, yellow and blue
d) blue and two yellow e) All yellow
b) Binomial law of probability distribution
Binomial distribution summarizes the number of trials, or observations when each
trial has the same probability of attaining one particular value. Most variables show
a particular pattern of frequency distribution and are depicted by standard deviation
and binomial distribution. The binomial distribution determines the probability of
observing a specified number of successful outcomes in a specified number of trials.
A binomial experiment is a statistical experiment that has the following properties:
▪ The experiment consists of n repeated trials.
▪ Each trial can result in just two possible outcomes. We call one of these
outcomes a success and the other, a failure.
▪ The probability of success, denoted by P, is the same on every trial.
▪ The trials are independent; that is, the outcome on one trial does not affect
the outcome on other trials.
The expected value, or mean, of a binomial distribution, is calculated by multiplying
the number of trials by the probability of successes.
Such as when you flip, a coin 2 times and count the number of times the coin lands
on heads. This is a binomial experiment because: the experiment consists of
repeated trials, each trial can result in just two possible outcomes - heads or tails, the
probability of success is constant - 0.5 on every trial and the trials are independent;
that is, getting heads on one trial does not affect whether we get heads on other trials
The population under observation can be divided into two distinct groups like male
and female, live birth and still birth
When two children are born one after the other, the possible sequences will be any
of the following four:
Probability 59
The probability of getting male or female child is ½. The word “and” is used
between the events implies multiplication
The chance of getting 2 of one sex and one of the opposite sex = 37.5 + 37.5 = 75%
But if the first 2 are female children and the third is desired to be male, the chances
are 75 + 12.5 = 87.5%, because probability of all three being females is 12.5% only,
i.e. 100 – 12.5 = 87.5%
Question 2
Find the probability that when a couple has 3 children, they will have exactly 2
boys. Assume that boys and girls are equally likely and that the gender of any child
is not influenced by the gender of any other child.
From the above
1 1 1 3
P(2 boys in 3 births) = + + = = 0.375
8 8 8 8
BINOMIAL LAW OF PROBABILITY DISTRIBUTION
Form the above p = the probability of a ‘success’, q = probability of ‘failure or not
happening’. Thus p + q = 1. In child birth; getting a boy (value of p) or a girl (value
of q)
Binomial law of probability distribution is formed by the terms of the expansion of
the binomial expression (p + q)n where n = sample size or number of events
Examples
In child birth; getting a boy (value of p) or a girl (value of q)
When n = 2, the terms of the expansion of (p + q)2 are p2, 2pq and q2
(p + q)2 = (p + q) (p + q) = p2 +2pq+q2
This means p2 would mean probability of getting 2 boys, q2 of 2 girls and 2pq of one
boy and one girl
Where n=3, the terms of the expansion of (p + q)3 = p3, 3p2q, 3pq2, and q3
(p + q)3 = p3+3p2q+3pq2+q3. This means 3 boys (p3), 3 girls (q3), 1 boy and 2 girls
(3pq2), and 2 boys and 1 girl (3p2q)
When n = 4, the terms of the expansion of (p + q)4 are p4, 4p3q, 6p2 q2, 4pq3 and q4
(p + q)4 = p4+4p3q+6p2 q2+4pq3+q4
Examples
For next two questions; after research in a certain village it was found out that 65%
were boys (p). So girls (q) = 1 - 0.65 = 0.35
1 What are the chances of getting 2 boys, 2 girls, or one girl and one boy
after two pregnancies?
Number of pregnancies 2; get 3 possible outcomes
(p + q)2 = p2 +2pq+q2; this means p2 would mean probability of getting 2
boys, q2 of 2 girls and 2pq of one boy and one girl
(p + q)2 = p2 +2pq+q2
(p + q)2 = (0.65)2 + (2x0.65 x 0.35) +(0.35)2
= 0.4225 + 0.455 + 0.1225
Probability 61
Probability getting
▪ 2 boys (p2) = 0.4225 or 42.25%
▪ 2 girls(q2) = 0.1225 or 12.25%
▪ One boy and one girl (2pq) = 0.455 or 45.5%
2 What are the chances of getting 3 boys, 3 girls, or 1 boy and 2 girls, and 2
boys and 1 girl after three pregnancies
Form (p + q)3 = p3+3p2q+3pq2+q3; means 3 boys (p3), 3 girls (q3), 1 boy
and 2 girls (3pq2), and 2 boys and 1 girl (3p2q)
(p + q)3 = p3+3p2q+3pq2+q3
(p + q)3 = (0.65)3 +(3x 0.652 x 0.35)+ (3x 0.65 x 0.352)+(0.35)3
= 0.2746 + 0.4436 + 0.2389 + 0.0429
The probability of getting
▪ 3 boys (p3) = 0.2746 or 27.46%
▪ 3 girls (q3) = 0.0429 or 4.29%
▪ 1 boy and 2 girls (3pq2) = 0.2389 or 23.89%
▪ 2 boys and 1 girl (3p2q) = 0.4436 or 44.36%
3 In four projects, 20% of children under 8 years of age were found to be
severely malnourished. If only 4 children were selected at random from
the four projects, what is the probability of selecting a child who is
severely malnourished and none malnourished (healthy child)
Let p be severely malnourished and q-be non- malnourished (healthy
child)
Thus; p=0.2 and q=1-0.2 = 0.8, n=4
(p + q)4 = p4+4p3q+6p2 q2+4pq3+q4
The probability of getting
▪ All 4 severely malnourished (p4) = (0.2)4 = 0.0016 or 0.16%
▪ Only 3 severely malnourished (4p3q) = 4x 0.23x0.8 =
▪ Only 2 severely malnourished (6p2q2) = 6x 0.22x0.82=
▪ Only 1 severely malnourished (4pq3) = 4x 0.2x0.83=
Non- malnourished (healthy child) (q4) = (0.8)4 = 0.4096 = 40.96%
c) Probability from shape of normal distribution or normal curve
A normal distribution, sometimes called the bell curve because the graph of its
probability density looks like a bell
Also known as called Gaussian distribution, after the German mathematician Carl
Gauss who first described it;
It is a distribution that occurs naturally in many situations especially in many
continuous data such as heights of people. The bell curve is symmetrical. Half of the
data will fall to the left of the mean; half will fall to the right.
62 Biostatistics
The standard deviation controls the spread of the distribution. A smaller standard
deviation indicates that the data is tightly clustered around the mean; the normal
distribution will be taller. A larger standard deviation indicates that the data is
spread out around the mean; the normal distribution will be flatter and wider.
The area under the normal distribution curve represents probability and the total area
under the curve sums to one
Most of the continuous data values in a normal distribution tend to cluster around
the mean, and the further a value is from the mean, the less likely it is to occur. The
tails are asymptotic, which means that they approach but never quite meet the
horizon (i.e. x-axis).
For a perfectly normal distribution the mean, median and mode will be the same
value, visually represented by the peak of the curve.
What is the empirical rule formula?
The empirical rule in statistics allows researchers to determine the proportion of
values that fall within certain distances from the mean. The empirical rule is often
referred to as the three-sigma rule or the 68-95-99.7 rule.
If the data values in a normal distribution are converted to standard score (z-score)
in a standard normal distribution the empirical rule describes the percentage of the
data that fall within specific numbers of standard deviations (σ) from the mean (X)
for bell-shaped curves.
The empirical rule allows researchers to calculate the probability of randomly
obtaining a score from a normal distribution
▪ 68% of data falls within the first standard deviation from the mean. This
means there is a 68% probability of randomly selecting a score between -1
and +1 standard deviations from the mean.
▪ 95% of the values fall within two standard deviations from the mean. This
means there is a 95% probability of randomly selecting a score between -2
and +2 standard deviations from the mean.
▪ 99.7% of data will fall within three standard deviations from the mean. This
means there is a 99.7% probability of randomly selecting a score between -3
and +3 standard deviations from the mean.
d) Probability of calculated values from tables
Probability of calculated values occurring by chance in case of ‘t’ and χ2 (Chi-
square) is determined by referring to the respective tables. The tables are found in
textbooks
Conditional probability 63
For example:
If A and B are two events, then the conditional probability A given B is written as P
(A | B); read as “the probability of A given that B has already occurred.”
The probability of A given B
P(A and B) P(A ∩ B)
P(A|B) = = ; Where P(B) > 0
P(B) P(B)
Where:
▪ P(A|B) – the conditional probability; the probability of event A occurring
given that event B has already occurred
▪ P(A ∩ B) – the joint probability of events A and B; the probability that both
events A and B occur
▪ P(B) – the probability of event B
Conditions for the independence of two events A and B
P(A/B) = P(A) P(B/A) = P(B)
P(A and B) = P(A) x P(B)
64 Biostatistics
▪ False positive (FP) results when a test indicates a positive status when the
true status is negative
▪ True negative results: test indicates a negative status when the true status is
negative
▪ False negative results when a test indicates a negative status when the true
status is positive.
▪ Sensitivity of the test (true-positive rate, or TPR) - this represents the
likelihood of a positive test in a diseased person or the conditional
probability of a positive test result given the presence of disease P(T+/D+)
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑖𝑠𝑒𝑎𝑠𝑒𝑑 𝑝𝑎𝑡𝑖𝑒𝑛𝑡𝑠 𝑤𝑖𝑡ℎ 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 𝑡𝑒𝑠𝑡 𝑎
= =
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑖𝑠𝑒𝑎𝑠𝑒𝑑 𝑝𝑎𝑡𝑖𝑒𝑛𝑡𝑠 𝑎+𝑐
Examples
2. Suppose that a population of N=120 men over 50 years of age who are
considered at high risk for prostate cancer have both the prostate-specific
antigen (PSA) screening test and a biopsy. The PSA results are reported as
low, slightly to moderately elevated or highly elevated based on the following
levels of measured protein, respectively: 0-2.5, 2.6-19.9 and 20 or more
nanograms per milliliter. The biopsy results of the study are shown below.
Prostate No Prostate
PSA Level (Screening Test) Totals
Cancer Cancer
Low(0-2.5 ng/ml) 3 61 64
Slight/Moderate Elevation
13 28 41
(2.6-19.9 ng/ml)
Highly Elevated (>29 ng/ml) 12 3 15
Totals 28 92 120
Calculate the predictive value that a man has prostate cancer given he has a,
moderately elevated and highly elevated levels of PSA (Ans 0.047, 0.317 and
0.80 respectively)
Bayes theorem
From conditional probability;
P(A ∩ B)
P(A|B) =
P(B)
:. P(AnB) = P(A|B) x P(B) ------- i
P(A ∩ B)
P(B|A) = ;
P(A)
:. P(AnB) = P(B|A) x P(A) -------- ii
Equation i and ii
P(A|B) x P(B) = P(B|A) x P(A)
𝑃(𝐵|𝐴) 𝑥 P(A)
P(A|B) = ----------- iii
P(B)
1. In order to manage credit history and risk at a whole sale Pharmacy, the
company rates the borrowers as lowest risk, medium risk and highest risk.
Risk means the chance that a borrower might fail to pay back. Based on
historical data; on average 30% customers are lowest risk, 60% rated medium
risk and 10% rated highest risk. After survey it was found out that 1% lowest
risk, 10% medium risk and 18% highest risk customers failed to pay back the
loan of products purchased on loan. If a customer was randomly picked from
defaulters list
a) What is the probability that they had received a lowest risk rating?
Let A1 - lowest risk customers, A2 - medium risk customers and A3 - highest
risk customers.
Defaulter – D (in Bayes’ theorem)
Based on scenario; P(Rating A1/Defaulter) = The question
However;
P(Rating A1) = 30% = 0.3
P(Rating A2) = 60% = 0.6
P(Rating A3) = 10% = 0.1
P(Defaulter|Rating A1) = 1% = 0.01
P(Defaulter|Rating A2) = 10% = 0.1
P(Defaulter|Rating A2) = 18% = 0.18
Bayes’ Theorem
𝑃(𝐷|𝐴1) 𝑥 P(𝐴1)
P(rating A1|Defaulter) = 𝑛
∑ 𝑃(𝐴𝑖) 𝑥 𝑃(𝐷|𝐴𝑖)
𝑖=1,2,3
𝑃(𝐷|𝐴1) 𝑥 P(𝐴1) = 0.01 x 0.3 = 0.003
𝑛
=∑𝑖=1,2,3 𝑃(𝐴𝑖) 𝑥 𝑃(𝐷|𝐴𝑖)𝑘 = P(A1) x P(D|A1) + P(A2) x P(D/A2) + P(A3) x
P(D|A3)
= 0.01 x 0.3 x 0.1 + 0.1 x 0.6 + 0.18 x 0.1= 0.081
0.003
P(rating A1|Defaulter) = = 0.037 = 3.7%
0.081
The probability the customer picked was given lowest risk rating = 3.7%
b) What is the probability that they had received a medium risk rating?
𝑃(𝐷|𝐴2) 𝑥 P(𝐴2)
P(rating A2|Defaulter) = 𝑛
∑ 𝑃(𝐴𝑖) 𝑥 𝑃(𝐷/𝐴𝑖)
𝑖=1,2,3
0.1 x 0.6
= = 0.333 = 33.4%
0.081
c) What is the probability that they had received a highest risk rating?
𝑃(𝐷|𝐴3) 𝑥 P(𝐴3)
P(rating A3|Defaulter) = 𝑛
∑ 𝑃(𝐴𝑖) 𝑥 𝑃(𝐷/𝐴𝑖)
𝑖=1,2,3
0.18 x 0.1
= = 0.222 = 22.2%
0.081
70 Biostatistics: By Mwesigwa Wilson
2. A certain virus infects one in every 400 people. A test used to detect the virus
in a person is positive 85% of the time if the person has the virus and 5% of
the time if the person does not have the virus. Consider a sample of 10,000
people
a) Find the probability that a person has a virus given that they have
tested positive
The question can be attempted by using conditional probability or Bayes’
theorem
Let A be event the person has the virus and B be the event the person tests
positive
Let A’ be event the person has the no virus and B’ be the event the person tests
negative
1
P(A) = = 0.0025 (total people with virus = 0.0025 x 10 000 = 25 )
400
400 −1
P(A’) = = 0.9975 (total people with no virus = 0.9975 x 10 000 = 9975
400
P(B│A) = 85%=0.85 (test positive with the virus = 0.85 x 25 = 21.25)
P(B|A’) = 5% = 0.05 (test positive with no virus = 0.05 x 9975 = 498.75)
Note: Decimal are points used in probability
Positive B Negative B’ Total
Virus A 21.25 3.75 25
No virus A’ 498.75 9476.25 9975
Total 520 9480 10 000
The probability that a person has a virus given that they have tested positive
21.25
P(A|B) = = 0.0409 = 4.09%
520
Use of Bayes’ theorem;
P(A) x P(B|A)
P(A|B) =
P(A)x P(B|A)+ P(A′) x P(B|A′)
0.0025 𝑥 0.85
= = 0.409 = 4.09%
0.0025 𝑥 0.85+0.9975 𝑥 0.5
b) Find the probability that a person does not virus given they test
negative
9476.25
P(A’|B’) = = 0.9996 = 99.96%
9480
7. A particular study showed that 12% of men will likely develop prostate cancer
at some point. A man with prostate cancer has 95% chance of a positive test
result after screening. A man without prostate has 6% chance of getting a false
test result. What is probability that a man has a cancer given he has a positive
result (0.683)
8. A factory has two machines I and II. Machine I produces 40% of items of the
output and Machine II produces 60% of the items. Further 4% of items
produced by Machine I are defective and 5% produced by Machine II are
defective. An item is drawn at random. If the drawn item is defective, find the
probability that it was produced by Machine II.
9. The chances of X, Y and Z becoming managers of a certain company are 4 : 2 :
3. The probabilities that bonus scheme will be introduced if X, Y and Z
become managers are 0.3, 0.5 and 0.4 respectively. If the bonus scheme has
been introduced, what is the probability that Z was appointed as the manager?
Let A1, A2 and A3 be the events of X, Y and Z becoming managers of the
company respectively. Let B be the event that the bonus scheme will be
introduced.
We have to find the conditional probability P (A3|B).
Total ratio= 4+2+3 = 9
4 2 3
P(A1) = P(A2) = P(A3) =
9 9 9
P(B|A1) = 0.3 P(B|A2) = 0.5 P(B|A3)= 0.4
P(A3) x P(B|A3)
P(A3|B)=
P(A1) x P(B|A1)+P(A2) x P(B|A2)+P(A3) x P(B|A3)
72 Biostatistics
This is applied when a random process or experiment, called a trial, can result in
only one of the two mutually exclusive and collectively exhaustive outcomes, called
a success and a failure. Examples include dead or alive; sick or well; boy or girl;
pass or fail and others; referred to as binomial experiment (Bernoulli trial - named
in honour of the Swiss mathematician Jacob Bernoulli who was one of the many
prominent mathematicians 1654–1705)
A binomial experiment or Bernoulli trial is a statistical experiment that has the
following properties or satisfies the following four conditions:
▪ The experiment consists of n repeated trials or there are n identical trials all
performed under identical conditions
▪ Each trial results in one of the two mutually exclusive and collectively
exhaustive outcomes; called a success and a failure.
▪ The probability of success, denoted by P, and the probability of a failure is
denoted by q, and p + q = 1; are the same on every trial or they remain
constant for each trial
▪ The trials are independent; that is, the outcome on one trial does not affect
the outcome on other trials.
Notation
The following notation is helpful in binomial probability.
▪ X: The number of successes that result from the binomial experiment in n
trials is called binomial random variable; x = 0,1,2,3 …….n
▪ n: The number of trials in the binomial experiment.
▪ P: The probability of success on an individual trial. P(X) gives the
probability of successes in n binomial trials
▪ Q: The probability of failure on an individual trial. (This is equal to 1 - P.)
▪ n!: The factorial of n (also known as n factorial). The factorial of n is the
product of number up to 1. 0! = 1, 4! = 4 x 3 x 2 x 1
▪ b(x; n, P): Binomial probability - the probability that an n-trial binomial
experiment results in exactly x successes, when the probability of success on
an individual trial is P.
▪ nCr: The number of combinations of n things, taken r at a time.. nCx is a
combination
Let "n" denote the number of observations or the number of times the process is
repeated, and "x" denotes the number of "successes" or events of interest occurring
during "n" observations. The probability of "success" or occurrence of the outcome
of interest is indicated by "p".
The probability distribution of the random variable X is called a binomial
distribution, and is given by the formula:
n n!
Cx = (from permutation and combination)
x! ∗ (n−x)!
n!
Or P(X“successes”) = px (1-p) (n-x)
x! ∗ (n−x)!
Where x = 0, 1, 2, 3, ........ n, q = 1 – p
The binomial distribution has the following properties:
▪ The mean value of binomial distribution (μ) = np
▪ The variance of binomial distribution (σ2) = np(1-p) or npq
▪ The standard deviation (σx) = √𝑛𝑝(1 − 𝑝)
1. Suppose that 80% of adults with allergies report symptomatic relief with a
specific medication. If the medication is given to 10 new patients with
allergies;
a) What is the probability that it is effective in exactly seven?
Observation; n = 10
Event of interest or success; x=7
p = 80% = 0.8
10!
P(X=7) = 0.87 (1-0.8) (10 - 7)
7! ∗ (10−7)!
10 𝑥 9 𝑥 8 𝑥 7!
P(X=7) = x 0.2097 x 0.008
7! 𝑥 3 𝑥 2 𝑥 1
720
P(X=7) = x 0.2097 x 0.008 = 120 x 0.2097 x 0.008
6
P(X=7) = 0.2013
There is a 20.13% probability that exactly 7 of 10 patients will report relief
from symptoms when the probability that any one reports relief is 80%.
b) What is the probability that none will report effectiveness of the drug?
X=0
10!
P(X=0) = 0.80 (1-0.8) (10 - 0)
0! ∗ (10−0)!
10!
P(X=0) = 0.8 (0.2) 10 = 0.0000001024
0
10!
There is practically no chance that none of the 10 will report relief from
symptoms when the probability of reporting relief for any individual patient
is 80%.
HYPOTHESIS
Alternative hypothesis
This shows that there is significant difference, change, effect and relationship
between the parameters: Also the negative or logical opposite of null hypothesis or
the contrary to the null hypothesis by stating that actual value of population
parameter is less than, greater than or not equal to the value in the null hypothesis
Alternative hypothesis states that there is a relationship between the two variables of
the study and that the results are significant to the research topic. The independent
variable has effect on the dependent variable. It is denoted by “Ha” or “H1”.
H1: µ1 ≠ µ2, H1: µ1 > µ2 or H1: µ1 < µ2
Mathematical symbols used
Null hypothesis: Ho Alternative hypothesis: Ha or H1
(=) equal to, is, same as, not (≠) not equal, different from, changed from,
changed from, not same as
( ≥ ) greater than or equal to > : greater than, above, higher than, longer
than, bigger than, increased, at least
( ≤ ) less than or equal to < : less than, below, lower than, shorter than,
at most, smaller than, decreased or reduced
State the null and alternative hypotheses of the following statements
1. The average age of patients visiting hospital in Fort Portal city is 28 years
Ho: µ = 28 years
H1: µ ≠ 28 years
2. The school records claims that the mean score in Biostatics is 53% over the
last 2 years. The new tutor in biostatics wishes to find out if the claim is true.
The tutors tests if there is significant difference between the average marks in
school records and students in the his class
Null hypothesis: The mean score in biostatics of students is 53%; H o: µ =
53%
Alternative hypothesis: The mean score in biostatics of students is not
53%; Ho: µ ≠ 53%
3. The average number of drugs in a pharmacy is at most 850
Ho: µ = 850 drugs
H1: µ < 850 drugs
4. The researcher wants to test whether the average height of boys in certain
group is different from 153 cm
Ho: µ = 153 cm
H1: µ ≠ 153 cm
5. A pharmacy student wishes to carry out a study on the volume a cough syrup
packed in a bottle; if it is less than 500ml as indicated on the label
Ho: µ ≥ 153 cm
H1: µ < 153 cm
Hypothesis testing 77
HYPOTHESIS TESTING
Formulate H1and H0
Null hypothesis represents status quo. Alternative hypothesis represents the desired
result.
Choose level of significance
The significance level states the probability of incorrectly rejecting or accepting H 0.
Types of error which can occur include: Type I and Type II errors (discussed later)
Usually, researchers select significance level of 5% (0.05) to 1% (0.01) and is set be
prior to actual testing
:. This means there is 5% (0.05) or 1% (0.01) probability of making a wrong
decision or researchers desires to be 95% to 99% confident before rejecting the null
hypothesis
Type I error
These are false positives which happen in hypothesis testing when the null
hypothesis is true but is rejected. Occurs when the researcher validated a statistically
significant difference when there isn’t any. A test with 95% confidence level means
that there is a 5% chance or probability of getting type 1 error. Or specifying a
criterion upon which the claim being tested is true or not
This can happen by inappropriate statistical analysis technique, poor sample
selection method, small sample size, chance or bad luck etc. This error is commonly
known as Type I error or false positive, denoted as alpha ()
Type error II
This occurs when the researcher accepts the null hypothesis that is false; denoted by
beta (). The probability of committing type II error called power of the test;
generally kept at 80% and determined by 1-β
This plays a role in sample size determination
Accept null Reject null
Null is true Correct – no error Type 1 error
Null is false Type error Correct – no error
Note: Both are serious, but traditionally Type I error has been considered more
serious, that’s why the objective of hypothesis testing is to reject H0 only when there
is enough evidence that supports it.
Therefore, the is chosen to be as small as possible without compromising .
Increasing the sample size for a given α will decrease β
Select appropriate test for the hypothesis
The test statistic generates the p-value used to determine whether the null hypothesis
should be rejected or retained (accepted). The selection of a proper test depends on:
▪ Scale of the data: categorical or interval
▪ Statistic to be measured or compared: means
▪ Sampling distribution of such statistic: Normal Distribution or T
Distribution
▪ Number of variables to be measured or compared: univariate, bivariate and
multivariate
Hypothesis testing 79
Examples of test
▪ Means
▪ Chi-square for proportions
▪ T-test (One tailed test or Two tailed test)
▪ Analysis of Variance (ANOVA)
▪ Z-test
Degrees of freedom
They are used to determine the critical value by comparing the value from test
calculation (using t-test, chi-square, ANOVA, z-test etc) with the corresponding
value at given confidence level
Degrees of freedom of an estimate is the number of independent pieces of
information that went into calculating the estimate. In order to get the degrees of
freedom (df) for the estimate, 1 subtracted from the number of items under
consideration
Degree of freedom for one sample t-test = n-1
Degree of freedom for two sample t-test = n1+n2 -1
Formula for t-test (explained later)
Df 5% 1% 0.1%
NOTE:
1. Most computer packages incorporate the tests especially SPSS
2. Degree of freedom is calculated and compared with the reference tables found in
the official books
3. This is an introduction on statistical tests; refer to official books for more
information
Goodness-of-Fit test
Before performing parametric test, check if the data is normally distributed by
plotting histogram and also goodness-of-fit test may be performed
They are statistical methods often used to make inferences about observed values; to
see how well sample data fit a distribution from a population with a normal
distribution or determines if sample data represents the data expected to be found in
the actual population.
Goodness-of-Fit tests can help determine if a sample follows a normal distribution,
if categorical variables are related, or if random samples are from the same
distribution.
Examples include
▪ The chi-square.
▪ Kolmogorov-Smirnov test
▪ Anderson-Darling test
▪ Shipiro-Wilk test
Chi Square
Chi-square is non-parametric test, developed by Karl Pearson
This is the most goodness-of-fit test; used for categorical data such as gender,
marital status, religion, color, race etc but not for numerical data or chi-square tests
involve checking if observed frequencies in one or more categories match expected
frequencies
The data used in calculating a chi-square statistic must be random, raw, mutually
exclusive, drawn from independent variables, and drawn from a large enough
sample
Chi square is abbreviated as χ is the Greek symbol Chi; χ2 chi square. Formula for
Chi-square
If the result is lower than alpha, the null hypothesis is invalid, indicating a
relationship exists between the variables
Chi square is non-parametric test
Application of chi-square test
Used in
▪ Test for homogeneity
▪ Goodness of frit of distribution
▪ Test for independence of parameters
Conditions for chi-square test
▪ Data must be inform of frequencies
▪ Observations must recorded on random basis
▪ Parameters must be independent
▪ Data must be organized into groups or categories with precise numerical
value
▪ Large sample size needed of about 50
Merits of chi-square test
▪ Can be applied for any distribution, either discrete or continuous, for which
the cumulative distribution function can be computed
▪ Can test association between variables
▪ Identifies difference between observed and expected values
▪ Easy and flexible calculation
▪ Provides detailed information
Limitations of chi-square
▪ Data must be numerical
▪ It requires sufficient sample size
▪ Test does not indicate cause and effect
▪ Data must be from random sample
▪ Difficult to interpret
Z-test
The test is used for comparing the mean of sample to some hypothesized mean for
the population in case of large sample; thus z-test is used when sample size is large
(≥30 elements) and when the standard deviation of population is known
𝑋− µ
Z= ; where
𝑆𝐸
X – mean of sample, µ - population mean
𝜎
SE – standard error = ; , σ-standard deviation of the population, n – sample size
√𝑛
Student’s t-test
This Any statistical hypothesis test in which the test statistic follows a Student’s t-
distribution if the null hypothesis is supported
t-test are applied where the sample is small and population standard deviation is
unknown. This is exactly like the z-test in computation; except of using the standard
deviation of population; use the standard deviation of the sample
t- test is used compare mean of two samples and in calculation of confidence
interval for sample mean
The t-statistic was introduced in 1908 by William Sealy
Student’s t-distribution
A family of continuous probability distributions that arises when estimating the
mean of a normally distributed population in situations where the sample size is
small and population standard deviation is unknown.
It plays a role in a number of widely used statistical analyses, including the
Student’s t-test for assessing the statistical significance of the difference between
two sample means, the construction of confidence intervals for the difference
between two population means, and in linear regression analysis.
The t-distribution can be used to estimate how likely it is that the true mean lies in
any given range
The t-distribution shows the degrees of freedom of different values, corresponds to
the different t-value at different confidence interval (explained above)
Assumption of t-test
▪ Samples are randomly selected
▪ Data is normally distributed
▪ Data variables are interval
Types or uses of t-test
▪ One-sample location t-test to compare one data sample to a hypothetical
distribution; used in measuring whether a sample value significantly differs
from a hypothesized value or whether the mean of a normally distributed
population has a value specified in a null hypothesis. The test applies to
continuous or non-continuous data that have a distribution that is not
significantly different from normal.
▪ Paired t-test to compare paired data samples: this is used to compare two
population means where there are two samples in which observations in one
sample can be paired with observations in the other sample; such as a
comparison of two different treatments where the treatments are applied to
the same subjects such as measure the size of a cancer patient’s tumor before
and after a treatment. This test applies to two paired samples of continuous or
non-continuous data that have distribution non-significantly different from
normal and similar standard deviations (≤ 2 difference)
▪ Independent t-test or unpaired t-test to compare unpaired data samples; this
aims to compare two unpaired data samples and applies to continuous or non-
continuous data that have an equal variance or a distribution normally
distributed data or not significantly different from normal. Sample sizes
should be similar (with ≤ 2 difference) for the two groups and, if n<30,
variances should also be similar (with ≤ 2 difference).
Used when two separate independent and identically distributed variables are
measured such as comparison of mean cholesterol levels in treatment group
with placebo group after administration of test drug or comparing
improvement in quality of life in patients who take drug A and those who
take drug B
This the widely used parametric test; a 2-sample t –test is used to establish
whether a difference occurs between the means of 2 similar data sets.
𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒 𝑜𝑓 𝑔𝑟𝑜𝑢𝑝 𝑎𝑣𝑒𝑟𝑎𝑔𝑒𝑠
t =
𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑒𝑟𝑟𝑜𝑟 𝑜𝑓 𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒
t is the t-value, x1 and x2 are the means of the two groups being compared, s2 is the
pooled standard error of the two groups, and n1 and n2 are the number of
observations in each of the groups.
Pooled standard deviation of the two groups
S21 – standard deviation for sample 1; S22 - standard deviation for sample 2
Pooled variance of the two groups
= √𝑃𝑜𝑜𝑙𝑒𝑑 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 (𝑆𝑝2)
A larger t-value shows that the difference between group means is greater than the
pooled standard error, indicating a more significant difference between the groups.
Assumptions for two sample t-test
▪ Data values are independent
▪ Simple random sampling was used to generate the sample
▪ Data is normally distributed
▪ Measurements of data are continuous
▪ The variances for the two groups are equal
Hypothesis testing 87
F-test
This is based on F-distribution of the statistic used to compare the variance of two-
independent samples. The test is used in the context of analysis of variance for
judging the significance of more than two sample means at one and the same time
The test statistic is calculated and compared with it’s probable value (seen in the F-
ration tables) for accepting or rejecting the null hypothesis
Analysis of Variance (ANOVA)
ANOVA is used to test hypothesis about the differences between two or more
means. There are two types
One-way ANOVA for unmatched samples
This is simplest type of ANOVA; where only one source of variation is investigated
in completely randomized experiment designs. It is an extension to three or more
samples of t-test procedure in 2 independent samples (unpaired student t-test is a
particular case of one-way ANOVA applied to two data samples) such as test of
effect of quality of several treatments of one cause of variation.
One way ANOVA is used to test the hypothesis that two or more samples are drawn
from the same distribution of values and have the same mean and variance.
Merits
▪ Very simple test to perform
▪ Reduce the experimental error to a great extent as in t-tests
▪ More variables can be investigated and suitable for laboratory experiment.
Two-way ANOVA for two-factor experiments
This type of test is used for experiments with two factors or two attributes or
variation
Non-parametric test
This is suitable for any continuous data, based on ranks of the data values. They do
not assume that data or population have any characteristic in common. Used for data
not normally distributed
Examples of non-parametric and corresponding parametric tests
Parametric Non-parametric
2 sample independent t –test Mann-Whitney U test
Dependent t – test for two Wilcoxan signed – rank test
samples
One way ANOVA Kruskal-Wallis one-way
Two way ANOVA Friedman two-way
McNemar’s, Fisher exact test
Pearson’s correlation Spearman’s rank correlation
Activity
Distinguish between parametric and non-parametric test
88 Biostatistics
𝑠 0.5
= = = 0.05
√𝑛 √100
1.05−1.2
t= =-3
0.05
degree of freedom: n-1 = 100 -1 = 99
from the t distribution; at 95% or 5% or 0.05 CI = 99 ≈ 1.660
The magnitude of the critical value is 1.660 and the magnitude of the
calculated t-value is - 3. Since the calculated test statistic is less than the
critical value, there is sufficient information to reject the null hypothesis.
Hypothesis testing 89
𝑠 140
= = = 19.799
√𝑛 √50
2470−2500
t= = −1.5152
19.799
The degrees of freedom associated with the above test statistic are: df=n−1 =
50−1=49
The critical t-value at 0.05 level of significance and with 49 degrees of
freedom is obtained from the t-distribution table test as 1.676
The level of significance α = 0.05, the critical values from the t distribution
on 49 degrees of freedom are -1.676 to 1.676. The calculated t does not
exceed these values, hence the null hypothesis cannot be rejected with 95
percent confidence.
Or
The magnitude of the critical value is -1.676 to 1.676 and the magnitude of
the calculated t-value is -1.515. Since the calculated test statistic is less than
the critical value, there is insufficient information to reject the null
hypothesis.
4. Test the hypothesis that a sample of size n = 25 with mean x = 79 and
standard deviation s = 10 was drawn at random from a population with mean
μ = 75 and unknown standard deviation.
Note: t-test is used since standard deviation of population is unknown
Ho = 75
Note
In other following calculations, one has to calculate the mean and standard deviation
10. In comparative study of male and female patients reported the following data
on inspiratory vital capacity
Sample size (n) Mean Standard deviations
Men 16 2.8 0.8
Women 25 1.9 0.5
Do the patients have the same mean inspiratory vital capacity?
11. Engineers at pharmaceutical industry wish two new tableting machines (A and
B). 8 equally batches of manufacture were run by each machine and the
following data obtained. Data in 1million tablets
A 5.1 6.5 3.6 5.5 5.7 4.3 3.8 6.4
B 4.8 6.4 3.1 5.5 5.5 4.4 3.6 5.9
Test at 1% level of significance, the hypothesis that the mean with machine B
is less mean of production with machine A
12. The following are results weights of patients visiting two pharmacies. Are the
mean weights of the two samples from the two pharmacies similar
Pharmacy A 55 53 60 71 96 41 49 91
Pharmacy B 51 64 98 80 79 79 58 83
92 Biostatistics
CORRELATION ANALYSIS
Moderately positive correlation: values of coefficient (r) lie between 0 and +1 i.e. 0
< r < 1. These are most type common type of correlation such as temperature and
pulse rate. When a scatter diagram is drawn; there will be an imaginary mean line
between the values as they raise
Correlation analysis 93
Negative correlation: when the value of one variable decreases with respect to
another. There two sub types
Perfect negative correlation: when values of the two variables are inversely
proportional to each other, i.e. when one rises, the other falls in the same proportion,
i.e. the correlation coefficient (r) = –1. When scatter diagram is drawn, all the points
fall on this straight line or the graph will contain all the observations on a straight
line as one (no scatter) as from either of the extreme ends because one variable
rises and the other falls in a fixed proportion
Such examples may include; temperature and number of colds in winter
Moderately negative correlation: In the scatter diagram, mean imaginary line will is
drawn between the values of variable. The values of coefficient (r) lie between –1
and 0, i.e. –1 < r < 0
Linear correlation
Linear correlation exists if the ratio of change in two variables is a constant or if the
amount of change in one variable tends to bear a constant ratio to the amount of
change in other variable.
If the values are plotted on a graph, the result is a straight line
Non-linear correlation
This is also referred to as curvilinear correlation. This is when the amount / quantity
of change one variable does not bear a constant change in the amount / quantity of
change in the other variable
Simple correlation
This occurs when correlation is studied between two variables. Such as advertising
of commodities and sales made, price and quantities sold etc
Multiple correlation
This occurs when correlation is considered among three or more variables
simultaneously
Partial correlation
When one or more variables are kept constant and the correlation or relationship is
studied between other variables
DETERMINATION OF CORRELATION COEFFICIENT
Different methods are used to determine correlation coefficient; such as
▪ Pearson’s correlation coefficient
▪ Scatter diagrams
▪ Spearman’s Rank correlation
Correlation analysis 95
SCATTER DIAGRAMS
This is the simplest method of studying correlation between two variables (x and y)
Draw scatter diagrams
1. With appropriate scale, the two variables x and y are taken on the X and Y
axes of a graph paper. With each pair of x and y value, mark a dot and get as
many points as the number of pairs of observation.
2. Draw the straight line (best line of fit).
3. If all the plotted points lie on a straight line rising from the lower left-hand
corner to the upper right-hand corner, correlation is said to be perfectly
positive and if some of the plotted points fall in a straight line, there is
medium to weak positive correlation between variables.
4. If all the plotted points lie on a straight line falling from the upper left-hand
corner to the lower right-hand corner of the diagram, correlation is said to be
perfectly negative and if some of the plotted points fall in a straight line,
there is medium to weak negative correlation between variables.
5. If the plotted points lie scattered all over the diagram, there is no correlation
between the two variables.
Or
X – value of X series
Y – value of Y series
The above formulas are used to find the correlation coefficient for the given data.
Based on the value obtained through these formulas, one can determine how strong
is the association between two variable
Properties of Correlation Coefficient
r is interpreted using the following properties
1. The value of r ranges from – 1 to 1
2. A positive value of r shows a positive correlation between the two variables
3. A negative value of r shows a negative correlation between the two
variables
4. A value of r = 1 indicates that there exists perfect positive correlation
between the two variables
Correlation analysis 97
1. The following data was obtained from number of cigarette smoking subjects as
the number of years lived
Sno 1 2 3 4 5 6 7 8 9
Cigarettes per week 25 35 10 40 85 75 60 45 50
No. years lived 63 68 72 62 65 46 51 60 55
Calculate the correlation of coefficient between the number of cigarettes
smoked per week in the last 5 years and the longevity of a test subject
x – the number of cigarettes smoked. y – years lived. n = 9
dx dx2 dy dy2 dxy
25 625 63 3969 1575
35 1225 68 4624 2380
10 100 72 5184 720
40 1600 62 3844 2480
85 7225 65 4225 5525
75 5625 46 2116 3450
60 3600 51 2061 3060
45 2025 60 3600 2700
50 2500 55 3025 2750
∑dx = 425 ∑dx = 24525 ∑dy = 542
2
∑dy =
2
∑dx. dy
33188 = 24640
9 𝑥 24640−425 𝑥 542
r=
√9 𝑥 24525−(425 𝑥 425).√9 𝑥33188−(542 𝑥 542)
r = - 0.61
This implies a negative correlation between the considered variables i.e. The
higher the number of cigarettes smoked per week in last 5 years, the lesser the
number of years lived. Note that it DOES NOT mean that smoking cigarettes
decreases the life span. Because, many other factors might be responsible for
one’s death
2. The following data was obtained from research of the heights (in inches) of
father and their eldest son from a village in Fort Portal. Compute the
correlation coefficient
Height fathers 66 68 63 67 64 69 72 68 70
Height of son 68 72 65 65 65 72 71 67 69
3. Find Karl Pearson‘s coefficient of correlation between the values of X and Y.
Find probable error (refer at the next subtopic)
X 46 68 74 72 80 70 90 100 103
Y 67 48 49 37 26 55 55 35 46
Correlation analysis 99
4. You are provided with the following data on students correct score and their
attitude. Calculate Pearson correlation coefficient (use actual mean method)
Correct score 17 13 12 15 16 14 16 16 18 19
Attitude 94 73 59 80 93 85 66 79 77 91
Let correct score be “x” and attitude be “y”
Y
X square square
17 94 1.4 14.3 20.02 1.96 204.49
13 73 -2.6 -6.7 17.42 6.76 44.89
12 59 -3.6 -20.7 74.52 12.96 428.49
15 80 -0.6 0.3 -0.18 0.36 0.09
16 93 0.4 13.3 5.32 0.16 176.89
14 85 -1.6 5.3 -8.48 2.56 28.09
16 66 0.4 -13.7 -5.48 0.16 187.69
16 79 0.4 -0.7 -0.28 0.16 0.49
18 77 2.4 -2.7 -6.48 5.76 7.29
19 91 3.4 11.3 38.42 11.56 127.69
∑x= ∑y= ∑=
156 797 ∑= 134.8 42.4 ∑= 1206.1
15.6 79.7
134.8
r= = 0.596
√42.4𝑥1206.1
Where, 0.6745 is a constant number ‘r’ stands for correlation coefficient and ‘n’
number of pairs of observation
Question 1
The scores of 9 students in biostatistics and research methodology are mentioned in
the table below.
Biostatistics 35 23 47 17 10 43 9 6 28
Research 30 33 45 23 8 49 12 9 31
1. Start by ranking the two data sets. Data ranking can be achieved by
assigning the ranking “1” to the biggest number in the column, “2” to the
second biggest number and so forth. The smallest value usually get the
lowest ranking. This is done for both sets of data
2. Create a table of the data of at least 6 columns
3. In another column “d”; d denotes the difference between ranks (R1 – R2).
4. In the f column,”d2” square your d values.
5. Add up all d square values, to obtain “(∑d2)”
6. Insert the values in the formula
Biostatistics Rank R1 Research Rank R2 d d2
35 3 30 5 2 4
23 5 33 3 2 4
47 1 45 2 1 1
17 6 23 6 0 0
10 7 8 8 1 1
43 2 49 1 1 1
9 8 12 7 1 1
6 9 4 9 0 0
28 4 31 4 0 0
∑d =12
2
6 𝑥 12
=1 - = 0.9
9 (81−1)
The Spearman’s Rank Correlation for this data is +0.9 and as mentioned above if
the ⍴ value is nearing +1 then they have a perfect association of rank.
Note:
While assigning rank, if two or more items have equal values (i.e., if there occur a
tie), they may be given mid rank. Thus, if two items are on the fifth rank, each may
ranked as (5 + 6) /2 = 5.5 and the next item in the order of size would be ranked
seventh. In there are 3 items item with fifth rank, each is rank 5 (15/3) and the next
item is ranked Eighth
When two or more ranks are equal, the following formula is used for computing
rank correlation
Where, m stands for the number of equal ranks. The term (m3 – m) is to be added in
the numerator for each group of equal rank both in x and y series
102 Biostatistics
REGRESSION ANALYSIS
If means are not to be calculated, a simple and direct method is adopted as indicated
below:
Question1
You are provided with the following data on students correct score and their
attitude. Calculate the attitude if the student has a score of 11
Score 17 13 12 15 16 14 16 16 18
Attitude 94 73 59 80 93 85 66 79 77
Let correct score be “x” and attitude be “y”
Y = a + bX
Y - dependent variable, a – Y intersect (from graphical representation of data), b –
slope and X – independent variable
𝑆𝐷 𝑌
b=r ; r – pearson correlation
𝑆𝐷 𝑋
a = y’ – bx’; y’ –mean of y sample and x’ – mean of x sample
Let correct score be “x” and attitude be “y”
Regression analysis 105
Y
X square square
17 94 1.4 14.3 20.02 1.96 204.49
13 73 -2.6 -6.7 17.42 6.76 44.89
12 59 -3.6 -20.7 74.52 12.96 428.49
15 80 -0.6 0.3 -0.18 0.36 0.09
16 93 0.4 13.3 5.32 0.16 176.89
14 85 -1.6 5.3 -8.48 2.56 28.09
16 66 0.4 -13.7 -5.48 0.16 187.69
16 79 0.4 -0.7 -0.28 0.16 0.49
18 77 2.4 -2.7 -6.48 5.76 7.29
19 91 3.4 11.3 38.42 11.56 127.69
∑x= 156 ∑y= 797 ∑= 134.8 ∑= 42.4 ∑= 1206.1
15.6 79.7
134.8
r= = 0.596
√42.4𝑥1206.1
42.4
SD X = √ =√ = 2.17
𝑛−1 10−1
1206.1
SD Y = √ =√ = 11.57
𝑛−1 10−1
𝑆𝐷 𝑌 11.57
b=r = 0.596 x = 3.18
𝑆𝐷 𝑋 2.17
a = y’ – bx’ = 79.7 – (3.18 x 15.6) = 30.01
Y = a + bX
Y = 30.01 + (3.18 x 11)
Y= 64.99
Differences between Correlation and regression.
▪ Correlation gives the degree and direction of relationship between the two
variables, whereas the regression analysis enables us to predict the values of
one variable on the basis of the other variable. Thereby, the cause and effect
relationship between two variables
▪ Correlation shows the quantity of the degree to which two variables are
associated. Linear regression finds the best line that predicts y from x, but
Correlation does not fit a line.
▪ Correlation is used during measure of both variables, while linear regression
is mostly applied when x - independent variable is manipulated.
106 Biostatistics
REFERENCES
Arun Bhadra Khanal. Mahajan's Methods in Biostatistics for Medical Students and
Research Workers. 8th Edition. Jaypee Brothers Medical Publishers (P) Ltd
Bratati Banerjee. Mahajan's Methods in Biostatistics for Medical Students and
Research Workers. 9th Edition. Jaypee Brothers Medical Publishers (P) Ltd
K.Visweswara Raoetal, Biostatistics
Marc M. Triola & Mario F. Triola. Biostatistics for the Biological and Health
Sciences with Statdisk. First edition 2014. Pearson New International Edition
T.D.V. Swinscow and M.J. Campbell. Statistics at Square One.
Wayne W. Daniel. Biostatistics. A Foundation for Analysis in the Health Sciences.
9th Edition. John Wiley & Sons, Inc