0% found this document useful (0 votes)
46 views112 pages

Biostatistics A4

The document is a handbook on Biostatistics authored by Dr. Mwesigwa Wilson, providing comprehensive training materials and consultancy services for pharmacy professionals. It covers various topics including data types, statistical methods, sampling, study designs, and data collection techniques relevant to health care. The handbook emphasizes the importance of understanding statistics in evaluating medical research and improving health care delivery.

Uploaded by

delanmarzl
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views112 pages

Biostatistics A4

The document is a handbook on Biostatistics authored by Dr. Mwesigwa Wilson, providing comprehensive training materials and consultancy services for pharmacy professionals. It covers various topics including data types, statistical methods, sampling, study designs, and data collection techniques relevant to health care. The handbook emphasizes the importance of understanding statistics in evaluating medical research and improving health care delivery.

Uploaded by

delanmarzl
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

MPHARMA

TRAINING SOLUTIONS
All Pharmacy Training Materials and Consultancy Services
P. O. BOX 51, Fort Portal - Uganda
Copyright ©2021: By Mwesigwa Wilson
Title: Biostatistics
All rights reserved
No part of this book may be reproduced or transmitted in any
form by electronic, or mechanical including photocopying or
recording without written permission of the copyright from the
author except for the use as quotation

ISBN 978-9913-619-47-9

Published and Printed by


MPharma Training Solutions Ltd
P.O BOX 51, Fort Portal - Uganda
+256 784 222 648 (WhatsApp) / +256 703 784 918
[email protected]

Disclaimer
The information in the handbook has been researched to ensure
it’s accurate. Always refer to standard available textbooks.
However, the writer and editor may not be held responsible for
any errors and omissions in the handbook
BIOSTATISTICS
A simplified handbook
2021 Edition

AUTHOR
Dr. Mwesigwa Wilson (B Pharm)
Bachelors Degree of Pharmacy – KIU
Postgraduate Diploma in Monitoring & Evaluation – MMU
Certificate in Medical Informatics

Presented by

MPHARMA
TRAINING SOLUTIONS
All Pharmacy Training Materials and Consultancy Services
P. O. BOX 51, Fort Portal - Uganda
MPHARMA
TRAINING SOLUTIONS
All Pharmacy Training Materials and Consultancy Services
P. O. BOX 51, Fort Portal - Uganda
Biostatistics I

TABLE OF CONTENTS

Introduction to Biostatistics---------------------------------------------- 1

Variables and data in Biostatistics-------------------------------------- 2


Variables ----------------------------------------------------------------- 2
Data ----------------------------------------------------------------------- 3
Classification of data sources ------------------------------------------ 4
Data collection methods ------------------------------------------------ 4
Data presentation -------------------------------------------------------- 7

Measure of central tendency -------------------------------------------- 13


The mean ------------------------------------------------------------------ 13
The median --------------------------------------------------------------- 14
The Mode ----------------------------------------------------------------- 15

Measures of location ------------------------------------------------------ 17


Percentiles ----------------------------------------------------------------- 17
Deciles--------------------------------------------------------------------- 18
Quartile -------------------------------------------------------------------- 19

Measures of dispersion --------------------------------------------------- 20


Range ---------------------------------------------------------------------- 20
Interquartile range ------------------------------------------------------- 20
Mean Deviation ---------------------------------------------------------- 23
Standard Deviation ------------------------------------------------------ 23
Coefficient of Variation ------------------------------------------------- 26

Measures of disease frequency ------------------------------------------ 27

Sampling --------------------------------------------------------------------- 31
Sampling methods ------------------------------------------------------- 31
Sample size determination ---------------------------------------------- 36

Study designs --------------------------------------------------------------- 37


Types of study designs -------------------------------------------------- 37
Blinding ------------------------------------------------------------------- 40
Bias and Confounding --------------------------------------------------- 41

Normal Distribution Curve ---------------------------------------------- 43


Skewness and Kurtosis 44

Permutation and Combinations ---------------------------------------- 47


Permutations -------------------------------------------------------------- 47
Combinations ------------------------------------------------------------- 48
II Biostatistics

Introduction to Probability ---------------------------------------------- 50


Definitions ---------------------------------------------------------------- 50
Laws of probability ------------------------------------------------------ 53
Probability tree diagrams ----------------------------------------------- 56
Conditional probability: Bayes’ Theorem ---------------------------- 63
Binomial Probability Distribution ------------------------------------- 72

Hypothesis and hypothesis testing ------------------------------------- 75


Hypothesis ---------------------------------------------------------------- 75
Hypothesis testing ------------------------------------------------------- 77
Tests for significance ---------------------------------------------------- 82
Examples in hypothesis testing----------------------------------------- 88

Correlation and regression ---------------------------------------------- 92


Correlation Analysis ----------------------------------------------------- 92
Regression Analysis ----------------------------------------------------- 103

References ------------------------------------------------------------------- 106


Introduction to Biostatistics I

INTRODUCTION TO BIOSTATISTICS

Statistic means measured or counted fact of information stated as a number or figure


such as weight, height etc.
Statistics is the methodology for collecting, analyzing, interpreting and drawing
conclusions from information. Or
Statistics: is the science of compiling, classifying and tabulating numerical data &
expressing the results in a mathematical or graphical form.
Biostatistics: is that branch of statistics concerned with data derived from biological
sciences or mathematical facts & data related to biological events such as medicine.
It uses the application of statistical methods to medical and biological problems
In medicine, research, diagnosis or treatment depends on measurement or counting.
Medical students should not depend on a statistician for statistical analysis of
biological data, instead should apply the knowledge gained in interpretation of
results or work alongside a statistician
Health care emphasizes research-based practice, using treatments which empirical
evidence has demonstrated to be effective. This requires health care practitioners to
understand statistics to make sense of research studies.
Relevance of statistics to health care delivery
For many health care professionals, statistics may seem to have little connection to
patient care issues and concerns. They include;
▪ Knowledge of statistics helps medical professionals evaluate studies that
assess the efficacy of treatments and interventions.
▪ Statistics in health care convey valuable information about the health /
welfare of a society.
▪ Statistics help to assess underlying the state of population health by
diagnosing the community
▪ Statistics help to know if health legislation and policies have been
implemented
▪ To find association between two attributes such as lung cancer and smoking
▪ Provide scientific evidence that supports new medical advances, medicines,
medical devices, treatments, procedures etc.
▪ Statistical techniques are used to make many decisions regarding health care
such as identifying signs and symptoms, prevalence and incidence
Branches/ types of Statistics
There are two major types of statistics.
Descriptive statistics
The branch of statistics involving the summarization and presentation of data.
Examples include
✓ Measures of frequency: Count, Percent, Frequency
✓ Measures of central tendency; Mean, Median, and Mode
✓ Measures of Dispersion or Variation; Identifies the spread of scores by
stating intervals eg Range, Variance, Standard Deviation
✓ Measures of Position; Percentile Ranks, Quartile Ranks
2 Biostatistics

Inferential statistics
The branch of statistics concerned with using sample data to interpret the descriptive
data and make an inference about a population of data. Used to generalize data from
samples to population, hypothesis testing and determine any association between
variables

VARIABLES AND DATA IN BIOSTATISTICS

VARIABLES

A variable is measurable characteristic of a population that differ from one member


to a another. A variable is any entity that can take on a different value. In research
this refers measurable characteristics, qualities, traits, or attributes of a particular
individual, object or situation being studied. Variables can be
Qualitative variable - characteristic which cannot be measured in quantitative form
but can only be identified by name or categories such as ethnic group, gender,
country of birth, degree of pain (minimal, moderate, severe), eye colour
Quantitative variable - one that can be measured and expressed numerically and it
can be either discrete or continuous.
The values of a discrete variable are usually whole numbers, such as the number of
episodes of diarrhea or number of children. A continuous variable is a measurement
on a continuous scale. Examples include weight, height, blood pressure, age,
business income and expenses, capital expenditure, class grades etc
Types of variables
▪ Dependent variables; sometimes called outcome variable is a variable that is
used to describe or measure the problem or observation under study. The
variable represents the outcome or output to check if there is an effect
▪ Independent variables; sometimes called experimental or predicator variable,
is a variable that is being manipulated in an experiment to observe effect on
the dependent variable;
Measuring variables: To establish relationships between variables, researchers must
observe the variables and record their observations. This requires that the variables
be measured
The process of measuring a variable requires a set of categories called a scale of
measurement; examples include
1. A nominal scale is an unordered set of categories identified only by name.
Nominal measurements only permit determination whether two individuals
are the same or different.
2. An ordinal scale is an ordered set of categories; describes the direction of
difference between two individuals.
3. An interval scale is an ordered series of equal-sized categories. Interval
measurements identify the direction and magnitude of a difference.
4. A ratio scale is an interval scale where a value of zero indicates none of the
variable. Ratio measurements identify the direction and magnitude of
differences and allow ratio comparisons of measurements.
Data 3

DATA

Data are values recorded on one or more observational units. Data can be
quantitative or qualitative; are values of the observation recorded for variables
Types of data
▪ Categorical (Qualitative) data
▪ Numerical (Quantitative) data - Discrete or continuous
▪ Based on their mathematical properties, data are divided into four groups:
Nominal, Ordinal, Interval, Ratio
Categorical (Qualitative) data
Qualitative data comprises of a characteristic which cannot be expressed
numerically such as gender, religion etc. subdivided into 3 types
Binary data: The variables or characteristics are divided into mutually exclusive
categories such as gender (male or female), diseased/not diseased, alive /dead
Nominal data: means name and count or variables with more than two categories
where order does not matter; data are alphabetic or numerical in name only and are
divided into more than two mutually exclusive categories. However, the categories
are without order or direction. Their use is restricted to keeping track of people,
objects and events. They are least powerful in measurement with no arithmetic
origin, order, direction or distance relationship
Such as blood groups (O, A, B and AB), marital status (single, married, divorced,
widowed), employment status (self-employed, public employee, unemployed) etc
Ordinal data: the variables are ordered or ranked like rankings or scaling. Ordinal
data use a Likert scale such as agree, neither agree, disagree, neither disagree and;
neutral a level of knowledge (excellent, good, average, poor), quality rating of
service or product, etc
Qualitative or numerical data
This is data that can be expressed numerically like age, temperature, height etc.
Classified into two
Discrete data can take only certain values by a finite or values take whole numbers
such as number of students in the class. Discrete data ‘jumps’, i.e., it ‘jumps’ from
one value to another but does not take any intermediate value between them
Continuous data can take any numerical value (within a range); for example, weight,
height, etc with an infinite number of possible values. In continuous data, here are
no gaps in the values of variables and have unlimited number of possible values
Note:
▪ Interval data: the intervals between values are the same. For example, in
Fahrenheit temperature scale
▪ Ratio data: The data values in ratio data do have meaningful ratios such as
20 -30
4 Biostatistics

CLASSIFICATION OF DATA SOURCES

Data sources are broadly classified into: Primary and secondary data sources
Primary data
Primary data means original data that collected by the investigator for the purpose of
study. The data is original in character and mostly generated by surveys, laboratory
and experimental methods; not yet been published. The data is more reliable and
accurate since it is first-hand information by the research investigator.
Merits
▪ The investigator collects data specific to the objective under study
▪ Data interpretation is better since targeting characteristics of data are
collected
▪ There is high the quality of the data collected
Demerits
▪ High cost in obtaining the data
▪ Time consuming involving collecting, getting the data collected and then
data analysis
▪ There may be bias in data collection
Secondary data
This is when the investigator uses data that has already been collected and readily
available from other sources. This can be divided into internal sources (within the
organizations such as reports) and external sources (outside the organization)
Such as data obtained from journals, reports, publications, compilations from
computerized databases and information systems etc.
Merits: Less expensive and Save times
Demerits
▪ Missing data can affect the results
▪ There may be errors
▪ Sample size may be inadequate

DATA COLLECTION METHODS

Data collection allow us to systematically collect data from the population about a
characteristic under study
Importance of data and data collection
▪ Data is one of the most important and vital aspect of any research studies.
Researches conducted in different fields of study can be different in
methodology but every research is based on data which is analyzed and
interpreted to get information.
▪ Data is the basic unit in statistical studies.
▪ Statistical information like census, population variables, health statistics, and
road accidents records are all developed from data
Data collection methods 5

Activity
✓ Explain factors selected when selecting data selection method (cost, human
resource, analysis, type of data collected etc.)

Types of data collection methods


▪ Observation
▪ Questionnaires
▪ Face-to-face interviews
▪ Postal or mail method and telephone interviews
▪ Available information
▪ Experiments
▪ Focus group discussions (FGD)
Questionnaires
This is the most commonly used method in data collection. Questionnaires are a list
of questions either an open-ended or close ended for which the respondent give
answers. The questionnaires may be structured or unstructured depending on the
type of data to be collected
▪ Open-ended questions permit free responses that should be recorded in the
respondent’s own words
▪ Closed questions offer a list of possible options or answers from which the
respondents must choose.
Questionnaire can be conducted via face to face, telephone, mail, live in a public
area, or in an institute and other methods
Steps in designing questionnaires
1. Determine the content of questionnaires from objectives
2. Formulate the questions
3. Sequence the questions
4. Format the questions
5. Test the question
6. Re-edit after questions
7. Translate the questions
8. Roll out the questions to the respondents
In interviewing using questionnaire, the investigator appoints agents or research
assistants who go to the respondents personally with the questionnaire, ask them the
questions given there in, and record their replies. They can be either face-to-face or
telephone interviews. These questionnaires can be physical on paper or electronic on
certain electronic gadgets such as PC tablets or certain specific software’s
Merits
▪ Easily explain question which are not understood
▪ Create rapport and interest during data collection
▪ Observations can be made as well
Demerits: Expensive to hire assistant and print the questionnaires
6 Biostatistics

Mailed questionnaires
The questionnaires are prepared by investigator and sent by post or electronic to the
respondents who provides the replies and return or send back a fully filled
questionnaire

Observations
This is a technique that involves systematically selecting, watching and recoding
behaviours of subjects or any other characteristic under study for the purpose of
gaining specified information. Different methods are used such as simple visual
observations, use of camera or other equipment’s (radiographic, x-ray, microscope)
done while letting the observing person know that he is being observed or without
letting him know they are under observation
Merits: Provide more accurate on behavior
Demerits
▪ Observers own bias or desires affect quality of data
▪ Needs more resources such as skilled labor
▪ Expensive if sophisticated equipment’s are used
Experiments
They are performed in controlled areas with controlled conditions such as
laboratories. The data collected per the specific objectives and later analyzed
Use of available data sources
▪ Records from hospitals
▪ International Publications like Publications by WHO, World Bank, UNICEF
▪ Publication of Ministry of Health and Other Ministries
▪ Published printed sources: There are varieties of published printed sources.
Their credibility depends on many factors. For example, on the writer,
publishing company and time and date when published.
▪ Books; Books are available today on any topic that you want to research.
After selection of topics books provide insight on how much work has
already been done on the same topic and you can prepare your literature
review.
▪ Journals/periodicals: The reason is that journals provide up-to-date
information which at times books cannot and secondly, journals can give
information on the very specific topic on which you are researching rather
talking about more general topics.
▪ Magazines/Newspapers: Magazines are also effective but not very reliable.
▪ General Websites (Google); generally, websites may or may not contain
very reliable information.
Common barriers in data collection

▪ Language barriers ▪ Expense


▪ Lack of adequate time ▪ Cultural norms
▪ Bias ▪ Inadequately trained and experienced staff
Data presentation 7

DATA PRESENTATION

Data collected should be presented in such a way as to be easily understood to the


reader to enable conclusions to be drawn simply by looking at the summarized data.
This helps in further statistical analysis where necessary.
Methods used in data presentation such as tables and graphs or drawings depend on
the type of data from the sample. Methods in tabulation include tabulation and
graphical or drawings
Tabulation
The most common method of data presentation is in form of tables or frequency
tables. Tables enable simple and concise way of presentation of data. Basic
principles in tabulation;
▪ Each table must have a concise and self-explanatory comprehensive title
▪ Tables should be numbered
▪ Tables must be formatted with an appropriate number of rows and columns.
Table column headings at the top and row classifiers on the left must be clear
and concise
▪ The frequency or number drawn from the sample must be clearly written
▪ Tabulation of frequency may be used for quantitative and qualitative data
Table 1: Table showing home district of patients
District Number
Hoima 20
Kyenjojo 12
Kagadi 5
Mubende 3
Kabarole 28
Table 2: Table showing conditions handled at a pharmacy
Condition Frequency
Hypertension 23
Malaria 52
Cough 43
Flue 22
Diabetes 20
Urinary tract infections 84
Skin disease 10
Table 3: Data of weight of patients at a clinic over a month
Class Frequency
5-14 10
15-24 15
25-34 19
35-44 5
45-54 30
8 Biostatistics

Graphical or drawings
Data can be presented in form of graphs and diagrams or pictorial representations.
Graphs used for quantitative data to provide to provide a single glance on the data
for easier interpretations
Types of graphs include
▪ Histogram
▪ Line chart or graph
▪ Scatter or dot diagram.
Presentation of qualitative, discrete or counted data is through diagrams such as
▪ Bar diagram
▪ Pie chart or sector diagram
▪ Map diagram or spot map; show geographical distribution of data on the map
of the location where data was collected
▪ Pictogram or picture diagram
Histogram
The histogram represents the frequency distribution for quantitative data. The
different groups (variable characters) are indicated on the horizontal line (x-axis)
while frequency - number of observations is marked on the vertical line (y-axis). It
comprises adjacent bars or form of column or rectangle representing the data or
number of observations. The height of rectangles indicate the frequency

common conditions handled at the


90 pharmacy 84
80
70
60 52
50 43
40
30 23 22 20
20 10
10
0

Conditions
Data presentation 9

For grouped data


Interval frequency Cumulative frequency Mid-point
5-9 9 9 7
10-14 7 16 12
15-19 10 26 17
20-24 13 39 22
25-29 7 46 27
30-34 14 60 32
35-39 8 68 37
40-44 5 73 42
45-49 6 79 47
∑=79
Frequency polygon
This is developed over histogram by joining the mid-points of class intervals at
height of frequencies by straight lines forming a polygon (figure of many angles)

Cumulative frequency diagram or ‘Ogive’


This a graph of the cumulative relative frequency distribution where the ordinary
frequency distribution table in a quantitative data has to be converted into a
cumulative frequency table as shown above
10 Biostatistics: By Mwesigwa Wilson

Line chart or graph


Line graph used to show trend in a variable over time or used to represent
continuous data. The time is placed on a horizontal axis and variable measured on
vertical axis with points being connected using line segments. The table below
shows population statistics of Uganda

Bar graph
This is used to represent categorical data and comprises of nonadjacent bars which
can be vertical or horizontal. Length of the bars, drawn vertical or horizontal,
indicates the frequency of a character. Exampled the marital status of respondents
was as follows
Status Number Percentage /%
Single 50 31.1
Married 86 53.4
Divorced 10 6.2
Widowed 15 9.3

Note: there can be multiple bars on the chart depending on the data being plotted
Data presentation 11

Pie charts
Pie chart depicts frequency distribution of categorical data in a circle (the “pie”),
with the sectors of the circle proportional in size to the frequencies in the respective
categories. Pie charts can be made highly attractive, by using color and three-
dimensional design enhancements, but become cumbersome if there are too many
categories

Scatter plot
This is a graphic presentation, made to show the nature of correlation between two
variable characters in the same sample or person such as hours and score from an
exam, weight and height etc. The data below shows hours spent studying and score
in biostatistics

Pictogram
This represents quantity by presenting stylized pictures or icons of the variable
being depicted – the number or size of the icon being proportional to the frequency.
When comparing between groups using a pictogram, it is preferable that same-sized
icons be used across groups (with their numbers varying) – otherwise the picture
may be misleading. Pictograms are more often used in mass media presentations
than in serious biomedical literature.
Stem-and-leaf plot or stem plot
This is a sort of mixture of a diagram and a table. It has been devised to depict
frequency distribution, as well as individual values for numerical data. This is a
simple way to order and also display
12 Biostatistics

The data values are examined to determine the first significant digits (the “stem”
item) on the left and their last significant digit (the “leaf” item) to the right
The stem items are usually arranged in ascending or descending order vertically, and
a vertical line is usually drawn to separate the stem from the leaf. The number of
leaf items should total up to the number of observations. However, it becomes
cumbersome with large data sets.
Question 1
The following shows a set of data on scores in biostatistics. 56, 55, 48, 78, 82, 90,
93, 66, 67, 69, 74, 79, 64, 92, 88, 66, 45, 74, 64, 58, 73, 40, 83, 84, 77, 88.
Construct a stem and leaf
Step 1: In order to create a stem and leaf plot, organize the data into groups
40, 45, 48
55, 56, 58
64, 66, 66, 67, 69
73, 74, 74, 77, 78, 79
82, 83, 84, 88, 88
90, 92, 93
Step 2: Create the plot with the stems as the tens and the leaves as the ones.
Stem Leaves
4 0, 5, 8
5 3, 6, 8
6 4, 6, 6, 7, 9
7 3, 4, 7, 8, 9
8 2, 3, 4, 8, 8
9 0, 2, 3
KEY: 4│0 = 40%
Question two: Your provided with the following data, draw the stem and leaf
diagram:
3.7, 5.4, 4.2, 0.5, 3.2, 0.7, 3.6, 1.3, 5.3, 3.9, 3.1, 2.5, 3.6, 1.6, 1.9, 3.6, 0.6, 0.5, 4.6,
4.9
Box plot
A box plot displays the five-number summary of a set of data. The five-number
summary is the minimum, first quartile, median, third quartile, and maximum. This
also referred to as whisker plot. The box plot is drawn from the first quartile to the
third quartile. A vertical line goes through the box at the median. The whiskers go
from each quartile to the minimum or maximum.
Measures of central tendency 13

MEASURE OF CENTRAL TENDENCY

A measure of central tendency (also referred to as measures of centre or central


position) is a summary measure that attempts to describe a whole set of data with a
single value that represents the average or central value of its distribution or a single
estimate of a series of data that summarizes the data and facilitates comparisons
between data.
Measures of central tendency locate the center or midpoint of a distribution.
Average value of a characteristic is the one central value around which all other
observations are dispersed or a measure of central tendency or concentration of all
other observations around the central value.
Types of measure of central tendency
▪ The mode-based on frequency.
▪ The median - positional estimate
▪ The mean. Arithmetic mean – mathematical estimate

THE MEAN

The mean is the arithmetic average which is the sum of the value of each
observation in a dataset divided by the number of observations. The mean is most
commonly used in statistical method
Properties of mean
▪ It is very easy to compute
▪ It may easily affected by the extreme scores
▪ It may not be an actual observation in the data set.
▪ Sum of each score’s distance from the mean is zero.
▪ It can be applied to interval level of measurement.
▪ It measures stability. Mean is the most stable among other measures of
central tendency because every score contributes to the value of the mean.
Limitations of the mean
▪ As the mean includes every value in the distribution the mean is influenced
by outliers/ extreme
▪ The mean cannot be calculated for categorical data, as the values cannot be
summed.
Calculations of mean for ungrouped data
A series of observations is indicated by the letter X and individual
observations by X1, X2, …, Xn. N the number of observations
𝑇𝑜𝑡𝑎𝑙 𝑠𝑢𝑚 𝑜𝑓 𝑡ℎ𝑒 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠 ∑𝑋
Mean = = X1+X2+X3.....Xn / n =
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠 𝑛
1 Here are the scores: 25, 20, 18, 17, 15, 14, 13, 15, 16, 12, 10. Find the mean in
the following scores. x (scores)
∑𝑋 25+ 20+18+17+15+ 14+13+15+16+12,+10
Mean = =
𝑛 11
175
= = 15.9
11
14 Biostatistics

Mean for grouped data


▪ Grouped data are the data or scores that are arranged in a frequency
distribution.
▪ Frequency is the number of observations falling in a category. Frequency
distribution is the arrangement of scores according to category of classes
including the frequency.
▪ The only one formula in solving the mean for grouped data is called midpoint
method.
∑𝑓𝑚
Mean =
𝑛
m = midpoint of each class or category
f = frequency in each class or category
Σ f m = summation of the product of fm

THE MEDIAN

The median is the middle value in distribution when the values are arranged in
ascending or descending order
Median, therefore, is a better indicator of central value when one or more of the
lowest or the highest observations are wide apart or not so evenly distributed
The median is used when the exact midpoint of the score distribution is desired and
also when where there are extreme scores in the distribution.
Properties of the Median
▪ It may not be an actual observation in the data set; especially if two figures
lie in the middle
▪ It is not affected by extreme values because median is a positional measure.
▪ It can be applied in ordinal level.
Median of ungrouped data
▪ Arrange the scores (from lowest to highest or highest to lowest)
▪ Determine the middle most score in a distribution if n is an odd number and
get the average of the two middle most scores if n is an even number.
Median of grouped data
𝑛
−𝑐𝑓𝑝
2
Median value = LB + 𝑥 𝑐. 𝑖
𝑓𝑚

Where,
LB = lower boundary of the median class (MC)
cfp = cumulative frequency before the median class if the scores are arranged from
lowest to highest value
fm = frequency of the median class
c.i = size of the class interval Median of Grouped Data
Measures of central tendency 15

THE MODE

The mode is the most commonly occurring value in a data set distribution. Classified
as unimodal, bimodal, trimodal or mulitimodal.
▪ Unimodal is a distribution of scores that consists of only one mode.
▪ Bimodal is a distribution of scores that consists of two modes
▪ Trimodal is a distribution of scores that consists of three modes or
multimodal is a distribution of scores that consists of more than two modes.
Limitations of the mode
▪ In some distributions, the mode may not reflect the centre of the distribution
very well
▪ There to be more than one mode for the same distribution of data
▪ For continuous data, the distribution may have no mode at all (i.e. if all
values are different).
For ungrouped data
Consider this dataset showing the retirement age of 11 people, in whole years: 54,
55 54, 56, 56, 57, 57, 58, 57, 58, 60, 60. The most commonly occurring value is 54,
therefore the mode of this distribution is 57 years
For grouped data
𝑑1
Mode = LB + 𝑥 𝑐. 𝑖
𝑑1+𝑑2
Where,
LB = lower boundary of the modal class Modal Class (MC) = is a category
containing the highest frequency
d1 = difference between the frequency of the modal class and the frequency above
it, when the scores are arranged from lowest to highest.
d2 = difference between the frequency of the modal class and the frequency below
it, when the scores are arranged from lowest to highest.
c.i = size of the class interval

1 Calculate the mean, median and mode of the following table


Class cf Mid-
Interval Boundary f point fm
5-9 4.5- 9.5 9 9 7 63
10-14 9.5-14.5 7 16 12 84
15-19 14.5-19.5 10 26 17 170
20-24 19.5-24.5 13 39 22 286
25-29 24.5-29.5 7 46 27 189
30-34 29.5-34.5 14 60 32 448
35-39 34.5-39.5 8 68 37 296
40-44 39.5-44.5 5 73 42 210
45-49 44.5-49.5 6 79 47 282
∑=79 ∑=2028
16 Biostatistics

∑𝑓𝑋𝑚
Mean = 𝑛
𝑛
−𝑐𝑓𝑝
2
Median = LB + 𝑥 𝑐. 𝑖
𝑓𝑚
𝑑1
Mode = LB + 𝑑1+𝑑2 𝑥 𝑐. 𝑖

2 In a study by MPharma Training Solutions on the age bracket of pharmacy


owners in Western Uganda, the following data was obtained
Age Frequency
20 - 29 17
30 – 39 45
40 – 49 56
50 – 59 24
60 – 69 35
70 - 79 16
Use the data to calculate
a) Mean age
b) Median age
c) Modal age

3 The results show the age of patients in years attended to at the pharmacy.
Age Frequency
0-9 15
10 - 19 26
20 - 29 23
30 - 39 13
40 - 49 17
50 - 59 34
60 - 69 19
70 - 79 8
80 - 89 11
a) Calculate the mean, median and mode
b) Draw a frequency distribution table

4 The group results of height in centimeters of students at certain pharmacy


school in Uganda. Calculate the mean, median and mode
Length (cm) Frequency
130 - 134 28
135 - 139 33
140 - 144 20
145 - 149 45
150 - 154 16
155 - 159 32
160 - 164 13
165 - 169 2
170 - 174 4
Measures of location 17

MEASURES OF LOCATION

Measures of central value locate the center or midpoint of a distribution of data in a


series. Measures of location used to locate other points of interest in the data set
Examples include
▪ Percentiles
▪ Deciles
▪ Quartiles
▪ Quintiles

PERCENTILES

Percentiles indicate the number at which a certain percentage of data falls below or a
value below which a certain percentage of observations lie. Percentiles split the data
into 100 equal parts, i.e., hundredths
For instance, the 40th percentile splits the data into the lower 40% of the values and
the upper 60% of the values. Percentiles are one version of measuring the variability
within a data set such as in continuous data – as height, age etc. 50 percentile
corresponds to median
The values in a series of observations arranged in ascending order of magnitude
which divide the distribution into 100 equal parts
To determine the percentile rank of x in data set
number of values below 𝑥
= *100, where n- number of all data set
𝑛
To determine the value of x existing at a certain percentile
𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒 (𝑛+1)
=( ) th
100

1. You are provided with the following data set: 2, 3, 4, 6, 7, 8, 9, 10, 11 and 12.
What is the percentile ranking of 7?
Note: If the data set is not arranged in ascending order, first arrange values.
Number below 7 = 4, n = 10
4
Percentile ranking of 7= x 100 = 40%
10
2. From same data set, what value exists at percentile ranking 60%?
60 (10+1)
Value at 60% = = 6.6th number
100
2, 3, 4, 6, 7, 8, 9, 10, 11, 12
Value at 60% = 6th + 0.6 (7th – 6th); 7th number =9, 6th =8
Value at 60% = 8 + 0.6 (9-8) = 8+0.6 = 8.6
3. You are provided with the following data set. 20, 5, 8, 5, 9, 17, 15, 11, 9, 10, 3
and 6. Find the value at percentile 40, 65 and 75
18 Biostatistics

DECILES

Deciles are a form of percentiles that split the data up into groups of 10%. Meaning,
every decile contains 10% of the data.
The first, second, …… ninth deciles by respectively. The fifth decile (corresponds to
median)
The second, fourth, sixth and eighth deciles which collectively divide the data into
five equal parts are called quintiles
Decile for ungrouped data
𝑛+1
Decile 1 (D1) = value of th item
10
2(𝑛+1)
Decile 2 (D2) = value of th item
10
4(𝑛+1)
Decile 4 (D4) = value of th item
10
9(𝑛+1)
Decile 9 (D9) = value of th item
10

1 For the data set, calculate the third and sixth deciles: 54 39 53 42 55 82 58
81 61 67 74 93 6870
Order: 39 42 53 54 55 58 61 67 68 70 74 81 82 93
3(𝑛+1)
Decile 3 (D3) = th
10
3 (14+1) 45
Decile 3 (D3) = = = 4.5th
10 10
4.5th = 4th + 0.5 (5th – 4th) = 54 + 0.5 (55 -54) = 54.5

QUARTILE

Similar to deciles, quartiles are a form of percentiles which split the data into
quarters such as 25th, 50th and 75th
They divide data into four parts. The first quartile, Q1, is referred to as the lower
quartile and the last quartile, Q4, is known as the upper quartile. Q1 splits the data
into the lower 25% of the values and the upper 75% of the values. The upper
quartile subdivides the data into the lower 75% of the values and the upper 25%
The difference between the upper quartile and the lower quartile is known as
the inter-quartile range, which indicates the spread of the middle 50% of the data.
Inter-quartile range = Q3 – Q1
𝑛+1
Quartile 1 (Q1) = value of th item
4
2(𝑛+1)
Quartile 2 (Q2) = value of th item
4
3(𝑛+1)
Quartile 3 (D3) = value of th item
4
Measures of central tendency 19

For grouped data


The following formulas are used to calculate percentiles, deciles and quartiles
𝑖𝑥𝑛
Quartiles: Quartile class (Qi) = value
4
𝑖𝑥𝑛
−𝑐𝑓
4
Qi = L + x class interval, i = 1, 2 and 3
𝑓
Where L – lower class boundary of class, cf – cumulative frequency before the
class, f – frequency of quartile class

𝑖𝑥𝑛
Percentiles: Percentile class (pi) = value
100
𝑖𝑥𝑛
−𝑐𝑓
100
Qi = L + x class interval, i = 1, 2 , 20, 40 ......... 99
𝑓

𝑖𝑥𝑛
Deciles: Decile class (Di) = value
10
𝑖𝑥𝑛
−𝑐𝑓
10
Qi = L + x class interval, i = 1, 2 , 6 , .... 9
𝑓

Question

1 Calculate the quartile 3, percentile 40 and decile 6 of the following data


Class Frequency
4-9 3
10-14 6
15-19 8
20-24 2
25-29 4
The results show the age of patients in years attended to at the
pharmacy. Calculate the percentile 55
Age Frequency
0-9 15
10 - 19 26
20 - 29 23
30 - 39 13
40 - 49 17
50 - 59 34
60 - 69 19
70 - 79 8
80 - 89 11
20 Biostatistics: By Mwesigwa Wilson

MEASURES OF DISPERSION

The measures of central tendency are not adequate to describe data. Two data sets
can have the same mean but they can be entirely different; thus to describe data, one
needs to know the extent of variability or measures of variation
Biological data; quantitative or qualitative, collected by measurement or counting
are very variable (vary from man to man and group to group) such as intelligence
quotient, behavior, physical characters etc.
The variability in a sample could be due to biological and experimnental methods.
Height, weight, blood pressure, WBC count, blood sugar, urea, cholesterol etc vary
depending on age, sex, social status or nature of work
Experimnental variability may be due errors by observer, intruments or equipments
used and sampling errors or bias. Measures of variability include
▪ Range
▪ Interquartile range
▪ Mean deviation
▪ Standard deviation
▪ Coefficient of variation
Other measures of variability of samples include: standard error of mean, standard
error of difference between two means, standard error of proportion, standard error
of difference between two proportions, standard error of correlation coefficient and
standard deviation of regression coefficient.

RANGE

Range is the simplest measure of dispersion and indicates the distance between the
lowest and the highest. The range is the difference between the largest and the
smallest observation in the data. This is a common biological characteristic and
defines the normal limits such as blood pressure, blood sugar levels, menstrual cycle
days, bilirubin levels
The prime advantage of this measure of dispersion is that it is easy to calculate. On
the other hand, it has lot of disadvantages; very sensitive to outliers and does not use
all the observations in a data set.

INTERQUARTILE RANGE

Interquartile range is defined as the difference between the 25th and 75th percentile
(also called the first and third quartile). Hence the interquartile range describes the
middle 50% of observations. If the interquartile range is large it means that the
middle 50% of observations are spaced wide apart.
The important advantage of interquartile range is that it can be used as a measure of
variability if the extreme values are not being recorded exactly and is also not
affected by extreme values.
Measures of dispersion 21

25% 25% 25% 25%


Q1 Q2 Q3
Median
Q1 – centermost value in 1st half of the rank – arranged set
Q2 – is the value of the median data set
Q3 – centermost value of 2nd half of the rank arranged set
Interquartile range (IQR) = Q3 – Q1
Determining IQR for even or odd data set involve the following methods
1. Order the values from low to high
2. Locate the median and separate data set into separate halves
3. Find Q1 and Q3: the median value of the two halves
4. Find IQR

1. Given the following data of score on mid-term test. Find the interquartile
range: 7, 7, 5, 11, 2, 8, 6, 3, 10, 11, 4
Step 1: 2, 3, 4, 5, 6, 7, 7, 8, 10, 11, 11
Step 2: Median – 7
Step 3: 2, 3, 4, 5, 6 │7│7, 8, 10, 11, 11
Median for Q1 = 2, 3, 4, 5, 6 = 4
Median for Q3 = 7, 8, 10, 11, 11 = 10
Step 4 = IQR = Q3 – Q1 = 10 – 4 = 6

2. For the following data set on different weight on mg of tablets in


manufacture: 50, 54, 54, 50, 48, 45, 51, 47, 53. Find IQR
Step 1: 45, 47, 48, 50, 50, 51, 53, 54, 54
Step 2: Median = 50mg
Step 3: 45, 47, 48, 50, │50│ 51, 53, 54, 54
Median for Q1 = 45, 47, 48, 50
47+48
= = 47.5
2
Median for Q3 = 51, 53, 54, 54
53+54
= = 53.5
2
Step 4: IQR = Q3 – Q1 = 53.5 – 47.5 = 6
3. Your are provided with the following data. Calculate the IQR
49, 21, 45, 27, 43, 25, 33, 32
22 Biostatistics

IQR for grouped data


For group data with class interval
1
Determine the class with lower quartile, Q1 = x N = ....Q1th from the cumulative
4
frequency
3
Determine the class with upper quartile, Q3 = x N = ....Q3th from the cumulative
4
frequency
Use the formula to determine Q1 and Q3

L – lower class boundary of quartile class


N – total frequency, C – class interval, FQ - frequency of quartile class
F1 = cumulative frequency of class before Q1 class
F3 = cumulative frequency of class before Q3 class

1. Find IQR
Score F Cf Cb
20-29 4 4 19.5-29.5
30-39 8 12 29.5-39.5
40-49 20 32 39.5-49.5 Q1
50-59 16 48 49.5-59.5 Q3
60-69 9 57 59.5-69.5
70-79 3 60 69.5-79.5
1
Q1 = x 60 = 15th score (in class of 39.5-49.5)
4
3
Q3 = x 60 = 45th score (in class of 49.5-59.5)
4
15−12
Q1 =39.5 + ( ) 10 = 41
20
45−32
Q3 =49.5 + ( ) 10 = 57.6
16
IQR = Q3 – Q1 = 57.6 – 41 = 16.6
2 Find the IQR
Interval frequency
10-19 5
20-29 4
30-39 15
40-49 13
50-59 12
60-69 9
70-79 11
Measures of dispersion 23

MEAN DEVIATION

This defines how far on average all values are from the middle. Procedure for
calculation
1. Find the mean of the values (X) provided - ẍ
2. Find the distance of each value from the mean or subtract all from the mean
(X - ẍ). Ignore all the negative signs and take them as positive
3. Add all the values (X - ẍ) to get ∑(X - ẍ)
∑(X − ẍ)
4. Mean deviation = ; where n is number of values or data provided
𝑛
Note: This is not commonly used in statistical analysis

STANDARD DEVIATION

This is the most commonly used measure of dispersion; a measure of spread of data
about the mean. SD is the square root of sum of squared deviation from the mean
divided by the number of observations. Procedure for calculation
∑𝑋
1. Calculate the mean; ẍ =
𝑛
2. Find the difference of each observations from the mean; (X - ẍ)
3. Square the differences of observations from the mean (X - ẍ)2
4. Add the squared values to get the sum of squares of the deviation ∑(X - ẍ)2
5. Divide this sum by the number of observations minus one to get mean-
𝑆𝑞𝑢𝑎𝑟𝑒 ∑(X − ẍ)
squared deviation; Variance (σ2) =
𝒏

6. Find the square root of this variance to get root-mean squared deviation,
called standard deviation.
Note: Having squared the original, reverse the step of taking square root.
SD = √𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒

In formulas (n–1) is used instead of n in the denominator, as this produces a more


accurate estimate of sample SD.
The direct method formula

Where ∑X – sum of the data provides and ∑Ẋ2 = sum of square of data and n –
number of data provided
A large SD shows that the measurements of the of the frequency distribution are
widely spread out from the mean Small SD means the observations are closely
spread
24 Biostatistics

SD summarizes the deviations of a large distribution from mean in one figure used
as a unit of variation and used in finding out the suitable sample size

1 The following are respiratory rates per minute at certain hospital: 18, 15,
20, 21, 19, 17, 23, 16, 25, 24, 26 and 14. Find the mean respiratory rate per
minutes and standard deviation
X (X - ẍ) (X - ẍ)2
18 -1.8 3.24
15 -4.8 23.04
20 0.2 0.04
21 1.2 1.44
19 -0.8 0.64
17 -2.8 7.84
23 3.2 10.24
16 -3.8 14.44
25 5.2 27.04
24 4.2 17.64
26 6.2 38.44
14 -5.8 33.64
∑=238 ∑=177.68
∑𝑋 238
Mean = = = 19.8 respiratory rates per minute
𝑛 12
SD = √𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒

177.68 177.68
SD = √ =√ = √16.15= 4.02
12−1 11

2 Using Direct method


X X2
18 324
15 225
20 400
21 441
238 𝑥 238
19 361 4898−
17 289 SD = √ 12
12−1
23 529
16 256 4898− 4720 178
SD = √ =√
25 625 12−1 11
24 576
26 676 SD = √16.18= 4.02
14 196
∑=238 ∑=4898
3 You are provided with the data on menstrual cycle for a certain group of
girls in a certain secondary school 27, 24, 28, 29, 30, 25, 29, 31, 27, 31, 33,
28, 28, 29 and 25 in days. Calculate the mean, variance and standard
deviation
Measures of dispersion 25

4 Calculate the variance and standard deviation of following


Height 159 155 163 172 179 159
Weight 54 165 56 70 76 53
5 The following are readings for blood pressure (both systolic and diastolic)
and fasting blood sugar (FBS) during a diabetic and hypertensive clinic at
Kabarole hospital. Find mean blood pressure and FBS for the patients as
required during report writing
SNo Systolic mmHg Diastolic mmHg FBS mmol/L
1 160 90 8
2 140 85 7.3
3 180 100 8.6
4 220 105 9.0
5 165 95 5.6
6 150 85 6.9
7 170 110 7.8
8 200 115 8.6
9 120 80 5.9
10 110 75 6.5
6 Calculate the variance and standard deviation
X 180 155 170 174 160 172 166
Y 170 165 180 180 164 169 172
SD for grouped data

Where m or mid – midpoint and ẍ - mean

1. Find SD
Interval f m fm fm2
5-9 9 7 63 441
10-14 7 12 84 1008
15-19 10 17 170 2890
20-24 13 22 286 6292
25-29 7 27 189 5103
30-34 14 32 448 14336
35-39 8 37 296 10952
40-44 5 42 210 8820
45-49 6 47 282 13254
∑=79 ∑=2028 ∑=63096
2028 x2028
63096−
SD = √ 79
= 11.89
79−1
26 Biostatistics

2. Calculate the SD of the following data students score

Interval f
10-19 5
20-29 4
30-39 15
40-49 13
50-59 12
3. Table below show the daily commuting times (in minutes). Calculate
variance and standard deviation
Times Number of workers
5-14 25
15-24 14
25-34 32
35-44 17
45-54 10

COEFFICIENT OF VARIATION

This is the ratio of the standard deviation to the mean. The value is expressed as a
percentage and used to compare the variability of one character in two different
groups having different magnitude of values such as height in adult and children and
others
𝑆𝐷
Coefficient of Variation = x 100
𝑀𝑒𝑎𝑛
1. In a research, the following data was obtained about mean weight and SD in
children and adults. Calculate the coefficient of variation or series shows greater
variation
Mean weight SD
Children 23kg 5
Adult 55kg 8
5
CV children = x 100 = 23.7%
23
8
CV adult = x 100 = 14.5%
55
The weight in children shows greater variation
2. Provided with following data
Range (grams) Frequency
90-99 5
100-109 4
110-119 8
120-129 17
120-139 12
140-149 5
150-159 8
160-169 6
Calculate the Coefficient of Variation and Draw a frequency polygon
Measures of frequency 27

MEASURES OF DISEASE FREQUENCY

These are studies which can be done in a population (a group of people with some
common characteristic, such as age, race, gender, or place of residence).
There two classifications
Measures of disease frequency in mathematical quantity
▪ Counts
▪ Proportion (percentage)
▪ Rate
▪ Ratio
Measures of disease frequency in epidemiology
▪ Prevalence
▪ Incidence
Counts
This is the simplest and the most basic measure; defined as absolute number of
persons with have characteristic of interest or disease
It is an important basic measure of disease frequency that is essential to detecting
trends or the sudden occurrence of a problem, such as an epidemic. Such as
▪ 530 cases of Covid-19 case in Kampala
▪ 60 confirmed ebola cases reported in Kivu – DRC
▪ 136 case of acute diarrhea cases reported in camp
▪ 120 cases of cholera in IDP camps
Simple counts of the number of diseased people are also
▪ Important to public health planners and policy makers for assessing the need
and allocation for resources in a population
▪ Used in surveillance of infectious disease for early detection of outbreaks
However, counts, usually depend on population or risk area (the bigger the size, the
higher the number of cases. This is not true always). The duration of observation
also affects the frequency of cases; the longer the observation period, the more cases
can occur.
Ratio
This is a value obtained by dividing one number by another
(either related or unrelated). A ratio doesn't necessarily imply any particular
relationship between the numerator and the denominator. A ratio is obtained by
dividing number of one event to the number of another event.
After a study, there were 125 men and 175 women. The ratio 125: 175 = 5:7. The
ratio of men: women = 5: 7
Proportion
A type of ratio that relates a part to a whole; often expressed as a percentage (%)
If there are 1216 female surveyed, only 243 reported using contraceptives. The
243
proportion of women who use of contraceptives = 𝑥 100 = 20%
1216
28 Biostatistics

Rate
This is frequency of events that occur in a defined time period, divided by the
average population of risk. In general population, examples of rate are (birth
malformation rate, crude death rate, smoking rate) in reality all these measures are
proportions.
adults is actually the number of adults
in a population who smoke
Smoking rate =
the total number of adults in the population

This is a proportion because the numerator is a subset of the whole.

Prevalence
Prevalence measures the proportion of individuals in a defined population that have
a disease or other health outcomes of interest at a specified point in time (point
prevalence) or during a specified period of time (period prevalence).
𝑁𝑜. 𝑜𝑓 𝑐𝑎𝑠𝑒𝑠 𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 𝑎𝑡 𝑡𝑖𝑚𝑒 𝑖𝑛 𝑑𝑒𝑓𝑖𝑛𝑒𝑑
𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛
Prevalence = 𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑒𝑟𝑠𝑜𝑛𝑠 𝑖𝑛 𝑑𝑒𝑓𝑖𝑛𝑒𝑑 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛
𝑎𝑡 𝑡ℎ𝑒 𝑠𝑎𝑚𝑒 𝑝𝑜𝑖𝑛𝑡 𝑖𝑛 𝑡𝑖𝑚𝑒

Uses
▪ Prevalence is a useful measure for quantifying the burden of disease in a
population at a given point in time
▪ Useful in planning health services
Of 20,563 residents in town Kampala on 1st June 2021, 657 tested positive for
corona virus.
657
The prevalence of corona virus = 𝑥 100 = 3.2%
20563
Incidence
This is a measure of the number of new cases of a disease or any other health
outcome of interest that develops in a population at risk during a specified time
period. There are two main measures of incidence: Cumulative incidence and
incidence rate
Cumulative incidence (CI)
This is related to the population at risk at the beginning of the study period. This is
the proportion of individuals in a population (initially free of disease) who develop
the disease within a specified time interval. Incidence risk is expressed as a
percentage (or if small as per 1000 persons).
▪ This is also referred to as risk
𝑁𝑜 𝑜𝑓 𝑛𝑒𝑤 𝑐𝑎𝑒𝑠 𝑜𝑓 𝑎 𝑑𝑖𝑠𝑒𝑎𝑠𝑒 𝑖𝑛 𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑒𝑑 𝑝𝑒𝑟𝑖𝑜𝑑 𝑜𝑓 𝑡𝑖𝑚𝑒
CI= 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑖𝑠𝑒𝑎𝑠𝑒 𝑓𝑟𝑒𝑒 𝑝𝑒𝑟𝑠𝑜𝑛𝑠 𝑎𝑡 𝑏𝑒𝑔𝑖𝑛𝑛𝑖𝑛𝑔 𝑜𝑓 𝑡ℎ𝑎𝑡 𝑝𝑒𝑟𝑖𝑜𝑑 𝑂𝑅
𝑁𝑜 𝑜𝑓 𝑖𝑛𝑑𝑖𝑣𝑖𝑑𝑢𝑎𝑙𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑎𝑡 𝑡ℎ𝑒 𝑏𝑒𝑔𝑖𝑛𝑛𝑖𝑛𝑔
Measures of frequency 29

1. At the beginning of May 2021, there were 237 patients in a certain hospital.
By 15th May 2021, 39 patients had tested positive for corona virus. What is
the cumulative incidence of corona virus?
39
CI = x 100 = 16.5%
237

2. The population of Kyenjojo Village was 18,922 in Feb 2021. Among them
1253 were had contracted malaria by Apr 2021. What is cumulative
incidence of malaria?

Incidence rate
This is the measure of the frequency of new cases of disease in a population and
takes into account the sum of the time that each person remained under observation
and at risk of developing the outcome under investigation. Has dimensions, unit is
time usually persons-year
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑛𝑒𝑤 𝑐𝑎𝑠𝑒 𝑜𝑓 𝑑𝑖𝑠𝑒𝑎𝑠𝑒 𝑖𝑛 𝑎 𝑔𝑖𝑣𝑒𝑛 𝑡𝑖𝑚𝑒 𝑝𝑒𝑟𝑖𝑜𝑑
Incidence rate =
𝑇𝑜𝑡𝑎𝑙 𝑝𝑒𝑟𝑠𝑜𝑛−𝑡𝑖𝑚𝑒 𝑎𝑡 𝑟𝑖𝑠𝑘 𝑑𝑢𝑟𝑖𝑛𝑔 𝑡ℎ𝑒 𝑓𝑜𝑙𝑙𝑜𝑤−𝑢𝑝 𝑝𝑒𝑟𝑖𝑜𝑑

The incidence rate is the rate of contracting the disease among those still at risk.
When a study subject develops the disease, dies or leaves the study they are no
longer at risk and will no longer contribute person-time units at risk.
Person-time at risk is a measure of the number of persons at risk during the given
time-period determined by summing of the results of events divided by the time

1. In 2020, the new cases of HIV was 35 among the youth aged 18 -25 years in
Kyegegwa. The person years among that group was 3,525. Calculate the
incidence rate

Prevalence vs incidence
The proportion of the population that has a disease at a point in time (prevalence)
and the rate of occurrence of new disease during a period of time (incidence) are
closely related
Prevalence depends on:
▪ The incidence rate
▪ The duration of disease
For example, if the incidence of a disease is low but the duration of disease (i.e. until
recovery or death) is long, the prevalence will be high relative to the incidence.
For example diseases like tuberculosis tend to persist for a longer duration, from
months to years, hence the prevalence (old and new cases) would be longer than the
incidence.
Conversely, if the incidence of a disease is high and the duration of the disease is
short such as diarrhea, the prevalence will be low relative to the incidence
:. Prevalence = Incidence x Duration
30 Biostatistics: By Mwesigwa Wilson

Other include
Morbidity: This is the state of having a specific illness of disease condition.
Morbidity is used to define the health of large population
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑒𝑟𝑠𝑜𝑛𝑠 𝑤𝑜𝑡ℎ 𝑑𝑖𝑠𝑒𝑎𝑠𝑒 𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛
= 𝑥 1000
𝑁𝑜 𝑜𝑓 𝑝𝑒𝑟𝑠𝑜𝑛𝑠 𝑖𝑛 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑎𝑡 𝑟𝑖𝑠𝑘

Mortality rate: This is relative frequency of death from particular disease to


population number of patients
Fatality rate: This is mortality among cases of particular disease
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑒𝑎𝑡ℎ 𝑓𝑟𝑜𝑚 𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐 𝑑𝑖𝑠𝑒𝑎𝑠𝑒
= 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑒𝑟𝑠𝑜𝑛𝑠 𝑤𝑖𝑡ℎ 𝑡ℎ𝑒 𝑠𝑎𝑚𝑒 𝑠𝑎𝑚𝑒 𝑖𝑛 𝑠𝑎𝑚𝑒 𝑥 1000
𝑝𝑒𝑟𝑖𝑜𝑑 𝑜𝑓 𝑡𝑖𝑚𝑒

Maternal mortality rate


𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑟𝑒𝑔𝑛𝑎𝑛𝑐𝑦 𝑟𝑒𝑙𝑎𝑡𝑒𝑑 𝑑𝑒𝑎𝑡ℎ
= 𝑥 1000
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑙𝑖𝑓𝑒 𝑏𝑖𝑟𝑡ℎ 𝑖𝑛 𝑡ℎ𝑒 𝑠𝑎𝑚𝑒 𝑝𝑒𝑟𝑖𝑜𝑑

Infant mortality rate


𝑁𝑜 𝑜𝑓 𝑑𝑒𝑎𝑡ℎ 𝑜𝑓 1𝑛𝑓𝑎𝑛𝑡𝑠 <1 𝑦𝑒𝑎𝑟 𝑜𝑓 𝑎𝑔𝑒
= 𝑥 1000
𝑁𝑜 𝑜𝑓 𝑙𝑖𝑣𝑒 𝑏𝑖𝑟𝑡ℎ𝑠 𝑖𝑛 𝑠𝑎𝑚𝑒 𝑝𝑒𝑟𝑖𝑜𝑑
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑏𝑖𝑟𝑡ℎ𝑠 𝑖𝑛 𝑎 𝑦𝑒𝑎𝑟
Birth rate = 𝑥 1000
𝑀𝑖𝑑 𝑦𝑒𝑎𝑟 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛

Neonatal mortality rate


𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑒𝑎𝑡ℎ𝑠 𝑖𝑛 𝑎 𝑦𝑒𝑎𝑟<28 𝑑𝑎𝑦𝑠 𝑜𝑓 𝑎𝑔𝑒
=
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑙𝑖𝑣𝑒 𝑏𝑖𝑟𝑡ℎ

Case fatality rate


𝑁𝑜 𝑜𝑓 𝑑𝑒𝑎𝑡ℎ𝑠 𝑖𝑛 𝑦𝑒𝑎𝑟 𝑓𝑟𝑜𝑚 𝑎 𝑝𝑎𝑟𝑡𝑖𝑐𝑢𝑙𝑎𝑟 𝑑𝑖𝑠𝑒𝑎𝑠𝑒
= 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑎𝑠𝑒 𝑜𝑓 𝑡ℎ𝑎𝑡 𝑑𝑖𝑠𝑒𝑎𝑠𝑒 𝑖𝑛 𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐 𝑥 1000
𝑝𝑒𝑟𝑖𝑜𝑑

Age specific death rate


𝑁𝑜 𝑜𝑓 𝑑𝑒𝑎𝑡ℎ𝑠 𝑖𝑛 𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐 𝑎𝑔𝑒 𝑔𝑟𝑜𝑢𝑝
= 𝑥 1000
𝑀𝑖𝑑−𝑦𝑒𝑎𝑟 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑖𝑛 𝑡ℎ𝑎𝑡 𝑎𝑔𝑒 𝑔𝑟𝑜𝑢𝑝

1. In a study, number of deaths in group of 30-35 in 2020 was 1256. The


estimated mid-year population in same group was 154,762. Calculate age-
specific death rate

2. Out of 2863 births in western Uganda, only 134 mothers died during
childbirth. Calculate the maternal mortality rate

3. In a certain sub-county of 5389 people, 20 persons were found to be have


leprosy. Calculate the morbidity
Sampling 31

SAMPLING

In research, it is rarely possible to collect data from every member of the population
(usually populations are so large that a researcher cannot examine the entire group);
instead, a sample is selected from the group to act as representative
Sampling
This is a technique of selecting individual members or a subset of the population to
make statistical inferences from them and estimate characteristics of the whole
population.
Population: The entire group of individuals
Sample
This is the selected group or items to represent the population in a research study or
specific group of individuals or items for data collection. The goal is to use the
results obtained from the sample to help answer questions about the population.
Sampling frame
This is the actual list of individuals that the sample will be drawn from. If you are
doing research on working conditions at Pharmaceutical manufacturing plant. The
population is all 1152 employees of the industry. The sampling frame is the
industry’s HR database
For example, if a drug manufacturer would like to research the adverse side effects
of a drug on the country’s population, it is almost impossible to conduct a research
study that involves everyone. In this case, the researcher decides a sample of people
from each demographic and then researches them, giving him/her indicative
feedback on the drug’s behavior
Sample size: the actual number of individuals to be included in the study

Purpose of sampling
▪ Sampling makes possible the study of a large, (different characteristics)
population.
▪ Sampling saves resources
▪ Sampling saves time for studying the entire population.

SAMPLING METHODS

▪ Probability sampling.
▪ Non-probability sampling
PROBABILITY SAMPLING
In this sampling method, all the members have an equal opportunity to be a part of
the sample. This is a sampling technique where a researcher sets a selection of a few
criteria and chooses members of a population randomly. The method is based on the
theory of probability
32 Biostatistics

Why use of probability sampling


▪ Reduce sample bias
▪ Diverse sample is drawn from the population
▪ Create an accurate sample
There are four types of probability sampling techniques:
▪ Simple random sampling
▪ Cluster sampling
▪ Systematic sampling
▪ Stratified random sampling
▪ Multi-Stage Random sampling
Simple random sampling
In this method, every single member of a population has an equal chance of being
selected. Each individual has the same probability of being chosen to be a part of a
sample.

For example, in a class of 150 students, if a sample of 50 is to be selected for


research. Chits are written and added to a box or bowl and each member picks one.
In this case, each of the 150 students has an equal opportunity of being selected.
Merits
▪ Better chances that the sample represents the whole population
▪ Can be concluded in shorter time duration
▪ It is less expensive to carry out the method and an easier way of sampling
▪ Sampling can be done with less technicality
Demerits
▪ The sample selected may have few variations
▪ Contacting all members of population can be difficult
Systematic sampling
This is similar to simple random sampling. Every member of the population is listed
with a number, and a member of the sample is selected at regular interval such as
every 10th number.
For example, in a class of 150 students, each student is given a number of 1 to 150
based in the alphabetical order. If starting point is 1, every 3rd is added to sample
until you get 50 students
Merits
▪ Avoids judgments or bias
▪ Cost effective
▪ Time and work involved is "relatively less”
▪ The design is simple and convenient to adopt
Demerits
▪ Less representative if population has hidden variables
▪ Only specific types of items may be included in the sample
Sampling 33

Stratified random sampling


This is a method in which the researcher divides the population into smaller groups
that don’t overlap or differ in important variables but represent the entire population.
While sampling, these groups can be organized and then draw a sample from each
group separately.
Or this method is followed when the population is not homogeneous. The population
under study is first divided into homogeneous groups or classes called strata and the
sample is drawn from each stratum at random in proportion to its size
For example, in a class of 150 students there are 90 females and 60 males. If 60
students are to be selected for sample and ensure equal gender balance or
representation. The students are divided into strata of male and female. Then 36
females and 24 males are selected from each group using either simple random or
systematic sampling methods
Merits
▪ Ensures that no any section of the population is underrepresented or
overrepresented.
▪ Provides greater level of accuracy
Demerits
▪ Tedious and time consuming
▪ Requires more resources
▪ Sometimes hard/ difficult to classify each kind of population into clearly
distinguished classes
Cluster sampling
This is a method where the entire population is divided into sections or clusters that
represent a population but each cluster should have similar characteristics to the
whole sample. The entire cluster is randomly selected to be included in the sample
instead of selecting a member
Clusters can be villages, wards, blocks, children of school etc
Merits
▪ Less cost needed as reduces on field visits to may units
▪ Useful when population are large and spread over large geographical region
Demerits
▪ The groups or clusters selected may not be representative of the entire units
of study
Multistage sampling
This involves use several random sampling techniques in selection of sample.
Selection is done in stages until the final sampling unit are arrived at.
The primary sampling unit is the sampling unit (usually large size) in the first
sampling stage then secondary sampling unit to determine the sample size
34 Biostatistics

NON-PROBABILITY SAMPLING
In this sampling method, not every member has equal chance of being selected or
include in the sample.
Types include
▪ Judgmental or purposive sampling
▪ Convenience sampling
▪ Snowball sampling
▪ Quota sampling
▪ Consecutive sampling
Convenience sampling
A convenience sample simply includes the individuals who happen to be most
accessible to the researcher such as surveying patients at certain hospital or passers-
by on a busy street. Researchers have nearly no authority to select the sample
elements, and it’s purely done based on proximity and not representativeness.
Convenience sampling is the option that’s most useful for pilot testing; may be
referred to as accidental, opportunity, or grab sampling
Data gathering with this method comes from people that are the easiest to reach or
contact. No criteria are in place for this sampling method beyond the willingness
and availability of people to participate in the work.
Advantages
▪ The sampling method is easier
▪ Affordable or cheap way to gather data.
Saves time when gathering data
▪ Useful as an intervention to correct dissatisfaction.
Provides qualitative information.
Disadvantages
▪ Sample doesn’t provide a representative result of the entire population
▪ May provide false data
▪ Results obtained may not be replicated in other settings
▪ Potential for bias during data collection
Judgmental or purposive sampling
This involves the researcher using their expertise to select a sample that is most
useful to the purposes of the research. It is often used in qualitative research, where
the researcher wants to gain detailed knowledge about a specific phenomenon rather
than make statistical inferences, or where the population is very small and specific.
An effective purposive sample must have clear criteria and rationale for inclusion.
For instance, when researchers want to understand the thought process of people
interested in studying for their master’s degree. The selection criteria will be: “Are
you interested in doing your masters in …?” and those who respond with a “No” are
excluded from the sample.
Sampling 35

Snowball sampling
This is a method used if the population is hard to access. Snowball sampling can be
used to recruit participants via other participants.
The sampling technique can be extensively used for conducting qualitative research,
with a population that is hard to locate. The sampling method is used in locating
samples hard to get such as get one participant who identifies other people in the
same situation as themselves and could inform others about the benefits of the study
and reassure them of confidentiality. Such as study in carried out in drug abusers,
sex workers, homelessness, HIV patients etc.
Advantages
▪ Allows for studies to be carried out in situations where there may be limited
participants or hard to tract participants
▪ Cost effective as the referrals are obtained from a primary data source
▪ Quicker to find participants as they come from reliable sources
Disadvantages:
▪ Some participants may refuse to participate in the study
▪ The sample obtained may be small hence potential sampling bias and margin
of error; and the study may provide inconclusive results.
Consecutive sampling
This is similar convenience sampling to where participants are selected at the ease of
a researcher. The research is then conduct over a period of time and data is collected
and then moves on to another sample.
Quota sampling
In this method, the sample size is determined first and then quota is fixed for various
categories of population, which is followed while selecting the sample. In this case,
as a sample is formed based on specific attributes, the created sample will have the
same qualities found in total population
The method reduces cost of preparing sample and field work, since ultimate units
can be selected so that they are close together. However, there may be bias in
sample selection
36 Biostatistics

SAMPLE SIZE DETERMINATION

Activity
✓ Explain factors which determines the sample size for the study

Terms used in sample size determination


▪ Population size: number of people fit the demographic under study: number
of people in an area, prevalence of disease etc. this can be obtained from
official texts
▪ Confidence level: defines the accuracy of the data obtained. For example, if
confidence level is 95%, the results will most likely be 95% accurate.
▪ The margin of error (confidence interval): this defines the sampling error
▪ Standard deviation: the measure of the dispersion of a data set from its mean.
Cochran’s sample size formula
This allows the calculation of an ideal sample size given a desired level of precision,
desired confidence level, and the estimated proportion of the attribute present in the
population. This is appropriate in situations with large populations.

Where:
▪ e is the desired level of precision (the margin of error),
▪ p is the (estimated) proportion of the population which has the attribute in
question,
▪ q is 1 – p.
The z-value is found in a Z table. The confidence level corresponds to a Z-score
Confidence level Z-score
90% 1.645
95% 1.96
99% 2.576

Calculate the sample size given p=0.5, precision of 5% and confidence level of 95%
Z = 1.96 p = 0.5 q= 1-0.5 = 0.5, e = 5% ≈0.05

Modification for smaller population

n0 is Cochran’s sample size recommendation


N is the population size
n is the new, adjusted sample size
Study designs 37

STUDY DESIGNS

This is a plan to conduct research or study which allows for the translation of
conceptual hypothesis into an operation hypothesis
Study designs are method of data collection of specific variable in a population with
respect to time, exposure and outcomes while others involve studying interventions
or exposures applied to specific groups

Activity
✓ Explain the factors to consider when choosing a study design

TYPES OF STUDY DESIGNS

1. Descriptive observational study designs


2. Analytical observational study designs
3. Analytical or Experimental or interventional study designs
DESCRIPTIVE OBSERVATIONAL STUDY DESIGNS
These are usually carried out in a group or patient to describe an event or variable or
problem with respect to time or place. The study does not try to quantify the
relationship but tries to give us a picture of what is happening in a population, e.g.
the prevalence, incidence, or experience of a group
Sub-types include
▪ Case report
▪ Case series
▪ Cross-sectional studies
Case-report study
This usually describes a single unique case or finding of interest such as findings of
clinical course or prognosis of case. This includes writing patient clinical history
such as demographics, presenting features, differential diagnosis, lab tests and
investigation results and diagnosis
Merits
▪ Easy and inexpensive to do in hospital
▪ Provides information on new disease or new therapy
▪ Helpful in hypothesis formation
Demerits
▪ Biased selection of subjects so that conclusions are difficult to generalize
Case-series study
This is a descriptive study that reports on data from a group of individuals who have
a similar disease or condition. This enables the of a case sub-sequentially hence
leading to generation of hypothesis about the cause
38 Biostatistics

Case series informs patients and physicians about history and prognosis and is
inexpensive to carry out plus also used in hypothesis generation. However the case
may not representative of the population
Cross-sectional study
A study that examines the relationship between diseases (or other health-related
characteristics) and other variables of interest as they exist in a defined population at
one particular time (i.e exposure and outcomes are both measured at the same time).
Best for quantifying the prevalence of a disease or risk factor, and for quantifying
the accuracy of a diagnostic test.
In cross-section study data is collected at one point in time
Advantages:
▪ Easy or simple to perform
▪ It takes less time to perform
▪ Cheap or inexpensive as compared to analytical studies
▪ Ethically safe
▪ Hypothesis can be easily generated
▪ Useful in determining the prevalence of the disease
Disadvantages:
▪ Establishes association at most, not causality or cause and effect
▪ There may be bias
ANALYTICAL OBSERVATIONAL STUDY DESIGNS
The study designs are useful in testing the etiological hypothesis such as
▪ If there any association between exposure or risks and outcome of disease or
determine if the association is not by chance
▪ The strength of association
Sub-types include
▪ Case control studies
▪ Cohort studies
Case-control study
This is a form of observational study that aims to identify risk factors for developing
the outcome of interest. Subjects with the outcome (cases) and without the outcome
(controls) are selected and risk factor exposure measurements are collected
retrospectively in both groups either from the subject or from any available records.
Cohort study
Cohort studies evaluate a possible association between exposure and outcome by
following a group of exposed individuals over a period of time (often years) to see
whether they develop the disease or outcome of interest. A cohort is a group of
individuals who share a common characteristic, and may be chosen based on a
population definition, or based on a particular exposure
Study designs 39

There are two sub-types


In prospective cohort studies the investigators conceive and design the study,
recruit subjects, and collect baseline exposure data on all subjects, before any of the
subjects have developed any of the outcomes of interest.
The subjects are then followed into the future in order to record the development of
any of the outcomes of interest. The follow up can be conducted by mail
questionnaires, by phone interviews, via the Internet, or in person with interviews,
physical examinations, and laboratory or imaging tests. Combinations of these
methods can also be used.
Advantages
▪ Incidence and prevalence of a disease can be easily calculated.
▪ Multiple diseases and outcomes can be studied at the same time.
▪ No ethical issues as in randomized control trials
Disadvantages
▪ Prone to bass: selection bias
▪ Cohort studies can be expensive and time consuming.
▪ Strict follow-up is required
▪ Sample sizes required are usually very large.
A retrospective cohort study (also known as a historic study or longitudinal study) is
a study where the participants already have a known disease or outcome. The study
looks back into the past to try to determine why the participants have the disease or
outcome and when they may have been exposed
Advantages
▪ It is less expensive to carry out
▪ Less time consuming
▪ Incidence and prevalence of a disease can be easily calculated.
▪ Multiple diseases and outcomes can be studied at the same time.
▪ No ethical issues as in randomized control trials
Disadvantages
▪ Missing data can affect results
ANALYTICAL OR EXPERIMENTAL STUDY DESIGNS
Also referred to as interventional study, used to study an experimental group (with
intervention) and control group (with-out intervention) such as in drug testing.
Types of analytical study designs include
▪ Randomized controlled clinical trials
▪ Quasi-experimental
Randomized Controlled Trial (RCT)
An experimental comparison study in which participants are allocated to
treatment/intervention or control/placebo groups using a random mechanism. Best
for study the effect of an intervention. This is also referred to as true-experimental
design
40 Biostatistics: By Mwesigwa Wilson

This involves pre-test for the study group to determine the baseline information. The
study group is then randomly assigned to both test or experimental (given the
treatment) and control group (with-out treatment) under the same conditions.
A post-test is carried out on the members of the two groups to determine the effect
of intervention or treatment or some-times the cause effect
Merits
▪ Unbiased distribution of confounders
▪ Blinding more likely
▪ Randomization facilitates statistical analysis
Disadvantages:
▪ Expensive to conduct
▪ It takes time
▪ Ethically problematic at times.
Quasi-experimental design
The prefix quasi means “resembling.” Thus quasi-experimental research is research
that resembles experimental research but is not true experimental research.
Quasi-experimental design aims to establish a cause-and-effect relationship between
an independent and dependent variable.
However, unlike a true experiment, a quasi-experiment does not rely on random
assignment. Instead, subjects are assigned to groups based on non-random criteria.

BLINDING

Blinding in research refers to a practice where study participants, investigators and


assessors are prevented from knowing certain information or unaware of an assigned
intervention that may somehow affects the results.
Types include
▪ Single-blind: The study participants do not know whether they are assigned
in study or control group; meaning they do not know whether they getting
drug under investigation or a placebo. The investigator knows who is getting
drug under investigation or a placebo
▪ Double-blind: the participants and investigator do not know which treatment
is being given or which group they are assigned. Only statistician may know
which intervention is used
Blinding of one or more parties is done to prevent observer bias. This refers to the
fact that most (if not all) researchers will have some expectations regarding the
effectiveness of an intervention. Blinding of observers provides a strategy to
minimize this form of bias
Blinding is also done to address or control for the placebo effect, a phenomenon in
which a simulated (and ineffective) treatment can sometimes improve a patient’s
condition, simply because the person has the expectation that it will be beneficial.
Expectation is key in the placebo effect.
Bias and confounding 41

BIAS AND CONFOUNDING

Different factors may affect the outcome or results between two variables under
study
Bias
A systematic error in the design, recruitment, data collection or analysis that results
in a mistaken estimation of the true effect of the exposure and the outcome.
Bias limits validity (the ability to measure the truth within the study design) and
generalizability (the ability to confidently apply the results to a larger population) of
study results. Groups or categories of bias include: selection and information bias
Information bias
This is error due to inaccurate measurement or the way data is obtained from
different study groups. Errors in measurement are also known as misclassification.
Sub-types include
Observer bias: result of the investigator’s prior knowledge of the hypothesis under
investigation or knowledge of an individual's exposure or disease status. Such
information may influence the way information is collected, measured or
interpretation by the investigator for each of the study groups.
For example, in a trial of a new medication to treat blood sugar, if the investigator is
aware which treatment arm participants were allocated to, this may influence their
reading of blood sugar levels. Observers may underestimate the blood sugar levels
in those who have been treated, and overestimate it in those in the control group.
Interviewer bias: occurs where an interviewer asks leading questions that may
systematically influence the responses given by interviewees
Recall (or response) bias: this is common in a case-control study where data on
exposure is collected retrospectively. Recall bias may occur when the information
provided on exposure differs between the cases and controls or in individuals who
can’t remember exposures accurately hence participant of the study may provide
inaccurate information
Missing data bias: this is due to certain information not recorded per participant
Social desirability bias: occurs where participants in study answer in a manner they
feel will be seen as favorable by others or give answers the investigator wants to
hear:
Detection bias: occurs due different techniques used to measure the outcome from
two groups
Reporting bias: individuals selectively suppress or reveal information such as
smoking history
Instrument bias: this is where an inadequately calibrated measuring instrument
systematically over/ under-estimates measurement.
42 Biostatistics

Minimizing bias
▪ Where possible, observers should be blinded to the exposure and disease
status of the individual
▪ Use of standardized questionnaires or pretested questionnaires
▪ Use calibrated instruments, such as sphygmomanometers.
▪ Training of interviewers.
▪ All data should be entered
▪ Development standard tool for collection, measurement and interpretation of
information.
Selection bias
This occurs when the study population is not representative of the target population
so that the measure of variable does not accurately represent the target population to
which conclusions are being extended.
Selection bias is a potential problem wherever individuals are selected for inclusion
in a study because of the presence or absence of certain characteristics.
Confounding
A situation in which the effect or association between an exposure and outcome is
distorted by the presence of another variable. Confounder is an extraneous variable
that wholly or partially accounts for the observed effect of a risk factor on disease
status. The presence of a confounder can lead to inaccurate results.
▪ Positive confounding is when the observed association is biased away from
the null. In other words, it overestimates the effect.
▪ Negative confounding is when the observed association is biased towards
the null. In other words, it underestimates the effect.
Effect modification
This is variable that differentially (positively and negatively) modifies the observed
effect of a risk factor on disease status. Different groups have different risk
estimates when effect modification is present
Normal distribution curve 43

NORMAL DISTRIBUTION CURVE

The normal distribution is a continuous probability distribution that is symmetrical


on both sides of the mean, so the right side of the centre of the curve is a mirror
image of the left side.
The normal distribution is often called the bell curve because the graph of its
probability density looks like a bell. It is also known as called Gaussian distribution,
after the German mathematician Carl Gauss who first described it
Standard distribution or normal distribution of observations in large population
is often around the mean. This is common in continuous variables such as age,
height, weight etc. In case a plot is made
▪ Some observations are above the mean and others are below the mean or
maximum number or frequencies will be seen in the middle around the
mean and fewer at the extremes, decreasing smoothly on both sides.
▪ Normally, almost half the observations lie above and half below the mean
and all observation are symmetrical on each side of the mean
Or
▪ Most of the continuous data values in a normal distribution tend to cluster
around the mean, and the further a value is from the mean, the less likely it
is to occur. The tails are asymptotic, which means that they approach but
never quite meet the horizon/x-axis
Expression of normal distribution in terms of mean and standard deviation
▪ Mean ± 1 SD limits, include 68% or roughly 2/3rd of all the observation.
Observations larger or smaller than mean 1 SD are fairly common
▪ Mean ± 2 SD limits, include 95% of observations while 5% of observations
will be outside these limits. Values that differ from the mean by more than
twice the standard deviation are rare, being only 5%. Similarly, mean ± 1.96
SD limits, include 95% of all observations.
▪ Mean ± 3 SD limits include 99.7%. Values higher or lower than mean 3 SD
are very rare, being only 0.3%. Mean ± 2.58 SD limits, include 99%.
Example

1 If at certain institution, the mean weight of the students was 55Kg and standard
deviation of 5kg. Determine the lower and upper limits at 68%, 95% and 99%
Mean ± 1 SD = 55± 5 ≈ 50 to 60kg contains 68% all the observation.
Mean ± 2 SD = 55± 5 x 2 = 55±10 ≈ 45 to 65kg, will include 95% of
observations
Mean ± 3 SD = 55± 5 x 3 = 55±15 ≈ 40 to 70kg, will include 99.7% of
observations
44 Biostatistics

Properties of the normal curve


▪ It is bell-shaped
▪ It is symmetrical
▪ Mean, mode and median coincide; are all equal
▪ It has two inflections. The central part is convex while and the points of
inflection, the curve changes from convexity
▪ The area under the curve is 1
The shape of normal distribution or curve is very useful in practice and makes
statistical analysis easy. It tells the probability of occurrence by chance or how often
an observation, measured in terms of mean and standard deviation can occur
normally in a population

SKEWNESS AND KURTOSIS

Skewness and Kurtosis are important descriptive statistic of data distribution


Skewness
This is a measure of symmetry of data distribution. Skewness refers to a distortion
or asymmetry that deviates from the symmetrical bell curve, or normal distribution,
in a set of data. A distribution, or data set, is symmetric if it looks the same to the
left and right of the center point or the mean.
The skewness for a normal distribution is zero, and any symmetric data should have
a skewness near zero. Negative values for the skewness indicate data that are
skewed left and positive values for the skewness indicate data that are skewed right.
By skewed left, we mean that the left tail is long relative to the right tail. Similarly,
skewed right means that the right tail is long relative to the left tail.
Skewness and Kurtosis 45

Positive Skewness
This means when the tail on the right side of the distribution is longer or fatter. A
positively skewed (or right skewed) distribution is a type of distribution in which
most of values are clustered around the left tail of the distribution and while the right
tail of distribution is longer. The mean is on the right of the peak value. The mean
and median will be greater than the mode.
Negative Skewness
This is when the tail of the left side of the distribution is longer or fatter than the tail
on the right side. The mean and median will be less than the mode

Kurtosis
This is a measure of whether the data are heavy-tailed or light-tailed relative to a
normal distribution. In other words, kurtosis identifies whether the tails of a given
distribution contain extreme values.
That is, data sets with high/large kurtosis tend to have heavy tails, or outliers
exceeding the normal distribution (ie 3 or more standard deviations from the mean).
Data sets with low kurtosis tend to have light tails, or lack of outliers than tails of
normal distribution
The standard normal distribution has a kurtosis of 3
46 Biostatistics

Types of kurtosis; an excess kurtosis defines the types of kurtosis; they include
Mesokurtic
Data that shows an excess kurtosis of zero or close to zero. This distribution is
shows tails nearly similar to that of normal distribution
Leptokurtic
The prefix of "lepto-" means "skinny”. Data that indicates a positive excess kurtosis;
shows heavy tails on either side, indicating long tails or large outliers which stretch
the horizontal axis of the graph, making the bulk of the data appear in a narrow
(skinny) vertical range. The kurtosis is greater than 3
Platykurtic
The prefix of "platy-" means "broad”. Data with this distribution shows a negative
excess kurtosis; with small outliers in a distribution making flat short tails
Permutation and combinations 47

PERMUTATION AND COMBINATION

Permutation and combination are the ways to represent a group of objects by


selecting them in a set and forming subsets. It defines the various ways to arrange a
certain group of data.
When data is selected from a certain group, it is said to be permutations, whereas the
order in which they are represented is called combination

PERMUTATIONS

Permutations are the different ways in which a collection of items/objects/data can


be arranged.
Permutation relates to the act of arranging all the members of a set into some
sequence or order. In other words, if the set is already ordered, then the rearranging
of its elements is called the process of permuting
For example:
The different ways in which the letters: A, B and C can be grouped together. The
letters can be group in 6 words such as ABC, ACB, BCA, CBA, CAB and BAC.
The same rule applies while solving any problem in Permutations.
The number of ways in which n things can be arranged, taken all at a time, nPn = n!
called ‘n factorial.’
Factorial formula
Factorial of a number n is defined as the product of all the numbers from n to 1.

1. The factorial of 3, 3! = 3*2*1 = 6. Therefore, the number of ways in which


the 3 letters can be arranged, taken all a time, is 3! = 3*2*1 = 6 ways

2. For 6 letter; 6! = 6*5*4*3*2*1 = 720 ways

When a letter occurs more than once in a word, divide the factorial of the number of
all letters in the word by the number of occurrences of each letter.

1. “PHARMACY”; A is repeated two times.


8! 8 𝑥 7 𝑥 6 𝑥 5 𝑥 4 𝑥 3 𝑥 2!
Hence = = = 20160
2! 2!

2. An example of “MISSISSIPPI”; has 11 letters with “I” being repeated 4


times, “S” being repeated 4 times and “P” being repeated 2 times
11! 11 𝑥 10 𝑥 9 𝑥 8 𝑥 7 𝑥 6 𝑥 5 𝑥 4!
= = = 34, 650
4!𝑥 4!𝑥 2! 4!𝑥 4 𝑥 3 𝑥 2 𝑥 1 . 2 𝑥 1

Number of permutations of n things, taken r at a time, denoted by:


n 𝑛!
Pr = (𝑛−𝑟)!

Note: 1! = 1 (1 factorial =1) and 0! = 1 (0 factorial =1)


48 Biostatistics

1 Find the number of words, with or without meaning, that can be formed
with the letters of the word ‘DRUGS’
‘DRUGS’ contains 5 letters.
Therefore, the number of words that can be formed with these 5 letters = 5!
= 5*4*3*2*1 = 120.

2 In how many different ways can 5 drug products be arranged on a shelf?


Number 5 = 5! = 5x4x3x2x1 = 120
OR n = 5 and r =5
5 5! 5! 5𝑥4𝑥3𝑥2𝑥1
P5= = = = 120
(5−5)! 0!=1 1

3 In how many different ways can 3 medical equipment be arranged in a


pharmacy shelf from a group of 7 items?
n 𝑛!
Pr = (𝑛−𝑟)!
; n = 7 and r =3
7 7! 7! 7 x 6 x 5 x 4!
P3 = = = = 210
(7−3)! 4! 4!

COMBINATIONS

This is the way of selecting items from a collecting or the different selections
possible from a collection of items such that (unlike permutations) the order of
selection does not matter
The different selections possible from the alphabets A, B, C, taken 2 at a time, are
AB, BC and CA.
It does not matter whether we select A after B or B after A. The order of selection is
not important in combinations
A combination is the choice of r things from a set of n things without replacement
and where order does not matter.
To find the number of combinations possible from a given group of items n, taken r
at a time, the formula, denoted by nCr is
n 𝑛𝑃𝑟 n!
Cr = =
𝑟! r! ∗ (n−r)!
n n n
Cn = 1 C0 = 1 C1 = n
n
Cr = nC(n-r)
For example, verifying the above example, the different selections possible from the
alphabets A, B, C, taken two at a time are
3 3! 3x 2!
C2 = = = 3 possible selections (AB, BC, CA)
2! ∗ (3−1)! 2! ∗ 1!
Permutation and combinations 49

3. How many teams of 4 can generated from a group of 12 pharmacy


professionals
Note: This is combination because the order does not how they picked
hence
n
Cr; n = 12 and r = 4
n n!
Cr =
r! ∗ (n−r)!

12 12! 12! 12 𝑥 11 𝑥 10 𝑥 9 𝑥 8!
C4 = = =
4!𝑥 (12−4)! 4!𝑥 8! 4 𝑥 3 𝑥 2 𝑥 1 𝑥 8!
11880
= = 495
24

4. Find the number of permutations and combinations if n = 12 and r = 2.

5. State the difference between combination and permutation

Permutation and combination applications


Determination of multimorbidity
Multimorbidity is the coexistence of multiple chronic conditions within an
individual. Multimorbidity is linked to many adverse health outcomes including
more frequent and longer hospitalizations, potentially harmful polypharmacy, higher
healthcare, costs reduced quality of life or increased treatment burden, and higher
mortality. Such as hypertension, diabetes, obesity and HIV; plus any other
conditions or patterns. The number and type of unique combinations (unordered
patterns) and unique permutations (ordered patterns) of multimorbidity of a group of
patients can easily be determined
Scientific discovery
For certain types of knowledge discovery problems, generation of combinatorial
sequences may become necessary in the process of yielding candidate solutions.
Through different simulation of projects or experiments
Probability
Combinations used in calculation of probability, refer to binomial probability
distribution
50 Biostatistics: By Mwesigwa Wilson

INTRODUCTION TO PROBABILITY

DEFINITIONS

Probability is a mathematical tool used to study randomness. It deals with the


chance (the likelihood) of an event occurring. A probability is a number expressed
as either decimal, fraction or percentage; and it's value is a measure of the likelihood
of an event occurring.
Probability is defined as the relative frequency of occurrence with which an event
on an average is expected to occur
Probability indicates chance of occurrence of an event or outcome; the different
probability laws can help to calculate and actually predict the chance of occurrence
Such as chances of survival after bite by rabid dog, chance of giving birth to boy or
girl, one drug working better than other etc
Definition of terms
Experiment
This is a planned operation carried out under controlled conditions that can produce
well-defined outcomes or an act that can be repeated under certain conditions
Random experiment
This is an experiment where the exact outcome cannot be predicted. This is
performed by trial such as throwing a die or tossing a coin etc.
Outcome: is a result of an experiment. Each outcome is called an event.
Event: This is the set of different favourable outcome of an experiment carried out
under set conditions; denoted as E. Events can be dependent or independent.
▪ Independent events: if probability of one event remains unaffected by the
probability of another event or if the occurrence of one event is unaffected by
the occurrence of other events.
▪ Dependent events: if the occurrence of one event affects occurrence of other
events.
Sample space: this the set of all possible results or outcomes of a random
experiment or this is a set of all possible outcomes in any number of trials. For
example; in throwing a die the possible results are {1, 2, 3, 4, 5, 6}. An event is a
subset of sample space. The sample space is usually written, or illustrated, using one
of the following:
▪ A list of all the possible outcomes written inside a set
▪ A sample space diagram, or
▪ A Venn diagram.
Mutually exclusive events: this is when the occurrence one event prevents
occurrence of any other event; such events cannot occur at the same time
Probability 51

Equally likely or symmetrical events: A set of events are said to be equally likely if
none has any preference of occurrence over the other thus all have the same chances
of occurrence or when there is no reason to expect the happening of one event in
preference to the other.
For example; when an unbiased coin is tossed the chances of getting a head or a tail
are the same.
Exhaustive events: All the possible outcomes of the experiments are known as
exhaustive events.
Favorable events: The outcomes which make necessary the happening of an event in
a trial are called favorable events. For example; if two dice are thrown, the number
of favorable events of getting a sum 5 is four, i.e., (1, 4), (2, 3), (3, 2) and (4, 1).
Simple events: This is when an event cannot be narrowed down into simpler events.
Such as during diagnosis; the presence of either diabetes and hypertension are
simple events
Compound events: This is when an event can further be disintegrated or narrowed
into simpler events. Such as during diagnosis; the presence of both diabetes and
hypertension are compound events
Conditional probability: This is the probability that an event occurs given that
another has already occurred
Joint probability: This is probability of the intersection of two events or the
probability that a subject picked at random from a group of subjects possesses two
characteristics at the same time
Law of large number: If an experiment is repeated again and again, the probability
of an event obtained from the relative frequency approaches the actual or theoretical
probability
Marginal or simple probability: This is the probability of a single event without
consideration of any other event
Subjective probability: This is the probability assigned to an event based on
subjective judgement, experience, belief and information
Probability of occurrence of an event:
Notation for probability
Probability is expressed by the symbol ‘p’. A, B and C denote specific events
▪ P(A) denotes probability of event A occurring
▪ P(B) denotes probability of event B occurring
▪ P(C) denotes probability of event C occurring
It ranges from zero (0) to one (1) and the sum of probabilities of all possible
outcomes of an event is equal to 1. Probability is always positive proper fraction
When p = 0, it means there is no chance of an event happening or its occurrence is
impossible.
If p = 1, it means the chances of an event happening are 100%
If the probability of an event happening in a sample is p and that of not happening is
denoted by the symbol q, then p + q = 1 or q = 1 – p
52 Biostatistics

If a random experiment has x possible mutually exclusive - exhaustive events likely


outcomes; and y favourable for an event A. Then, the probability of occurrence of
event A is calculated by formula
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝑠 𝑓𝑎𝑣𝑜𝑟𝑎𝑏𝑙𝑒 𝑡𝑜 𝑡ℎ𝑒 𝑒𝑣𝑒𝑛𝑡 (𝑦)
=
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝑠 −𝑒𝑥ℎ𝑎𝑢𝑡𝑠𝑖𝑣𝑒 (𝑥)

Arithmetically, the probability (p) or chances of occurrence of a positive event;


𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑒𝑣𝑒𝑛𝑡𝑠 𝑜𝑐𝑐𝑢𝑟𝑖𝑛𝑔
P=
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑖𝑎𝑙𝑠

1. Probability of getting head or tail in one toss of coin; p = ½ or 0.5 and q = ½


or 0.5

2. From a pack of cards, the probability of drawing any one of 4 aces in one
attempt from pack of 52 cards;
4 1
P= = .
52 13
From a pack of cards, the probability of not drawing any one of 4 aces in
one attempt from pack of 52 cards
1 12
q= 1-p = 1- ; q= .
13 13

3. The chance of getting male or female child in one pregnancy are fifty-fifty
or half and half; p = ½ or 0.5 and q = ½ or 0.5

4. In a certain village, if twins were born twice out 126 different pregnant
women;
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑤𝑖𝑛𝑠 𝑏𝑖𝑟𝑡ℎ 2 1
P= = =
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑟𝑒𝑔𝑛𝑎𝑛𝑡 𝑤𝑜𝑚𝑒𝑛 126 63
1 62
The probability of only single birth q=1- =
63 63

5. Out of 556 different drugs in a pharmacy, profits are made only 440 drugs.
Calculate the probability of picking a profit making drug during dispensing

6. At cancer institute; only 4 cancer patients died out of 240 patients in the
month of March. Calculate the probability of survival and death of cancer
patients in the month of March

7. The results from a research indicate; out of 1038 randomly selected adults,
52 believe that second-hand smoke is not at all harmful. Calculate the
probability of selecting a person who believes that second-hand smoke is
not at all harmful

8. 1
If the probability of being rhesus negative is , What is the probability of
10
being rhesus positive
Probability 53

LAWS OF PROBABILITY

Different laws are used in explanation of probability; these include;


1. Addition law of probability
2. Multiplication law of probability
3. Binomial law of probability distribution
4. Probability from shape of normal distribution or normal curve
5. Probability of calculated values from tables
Explained
a) Addition law of probability
In mutually exclusive events; the probability of occurrence of one event excludes the
occurrence of the other.
These include birth of male excludes birth of female, birth of rhesus negative baby
excludes birth of rhesus positive baby, getting tail on tossing a coin etc. The word
“or” is used between the events
Such events will occur in one of ways such as – getting head or tail in tossing, birth
can either be male or female, blood groups will be A, B, AB, or O.
All the events will have an individual probability or relative frequency of
occurrence; and the total probability will be equal to the sum of individual
probabilities provided the events are mutually exclusive
Mutually exclusive events follow the addition law of probability
The addition law of probability definition; if two events A and B are mutually
exclusive then the probability of occurrence of either event A or B or both is
obtained by adding the probability of both events. That is P(A+B) =P(A) + P(B) =
P(A ∪ B) “∪” symbol signifies union
This means Probability of A plus B equal to
=Probability of A plus Probability of B
=Probability of A union B

If the number of mutually exclusive events are n and P1 in the individual probability
then total probability, P is calculated as P = p1 + p2, ..., + pn = 1
6. The probability of getting male or female child is ½. The total probability
is ½ + ½ = 1
7. 1
In one cut, chance of getting king of hearts is and of getting any of the
52
four kings will be;
1 1 1 1 4 1
= + + + = =
52 52 52 52 52 13
54 Biostatistics

Addition law when events are not mutually exclusive; if events A and B are not
mutually exclusive, then the probability of at least one of the events A and B is
calculated as
P(A or B)=P(A+B)=P(A∪B) = P(A)+P(B)-P(AB)
This means
Probability of A plus B = Probability of A plus probability of B – Probability of
both A and B

8. You are provided with the following information; about diseases condition
by patients (Diabetic and Hypertensive patients)
Diabetic Non-diabetic Total
Hypertensive 48 74 122
Non-Hypertensive 188 80 268
Total 236 154 390
a) Probability of being Diabetic
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑏𝑒𝑖𝑛𝑔 𝑑𝑖𝑎𝑏𝑒𝑡𝑖𝑐 236
P= = = 0.605
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑎𝑡𝑖𝑒𝑛𝑡𝑠 390

b) Probability of being non-hypertensive


268
P= = 0.678
390
c) Probability of being both diabetic and hypertensive
48
P= = 0.123
390
d) Probability of being either diabetic or hypertensive or both
310
P= = 0.795
390

Multiplication law of probability


This law is applied to two or more events occurring together but they must not be
associated, i.e. must be independent of each other. The word “and” is used between
the events
Compound probability of two independent events A and B; P(AB) = P(A) * P(B) =
P(A∩B): the symbol “∩” signifies “intersection”. In words it means; the compound
probability of A and B equals
= Probability of A multiplied by Probability of B
= Probability of A intersection B

If events A and B are mutually exclusive: P(A and B) = 0. Such as being heavy
smoker and getting lung cancer
Probability 55

1. If a dice is thrown twice in succession, what will be the probability of


getting ‘3’ and ‘5’ or ‘5’ and ‘3’.
Note: Multiplication and addition laws both have to be applied in such cases,
both the words ‘and’ and ‘or’ are to be used.
1
In the first throw; the probability of getting ‘3’ is in and ‘5’ in the second
6
1
throw is . P(getting ‘3’ in the first throw and ‘5’ in the second throw) =
6
1 1 1
𝑥 =
6 6 36
1
In the second throw; the probability of getting ‘5’ is and ‘3’ in the second
6
1
throw is . P(getting ‘5’ in the first throw and ‘3’ in the second throw) =
6
1 1 1
𝑥 =
6 6 36
1 1
If the sequence of ‘3’ and ‘5’ or ‘5’ and ‘3’ is taken into account= + =
36 36
2 1
=
36 18

2 The probability of having a female or male baby is ½. The probability of


9 1
having rhesus positive is . The probability of having twins is . What will
10 63
be the probability of having a child being male and rhesus positive
Note: Sex of birth and Rh factor are independent events and occur in any
child

3 Probability of being color blind is 1 in 12 and of being male is 1/2.


Note: One cannot use multiplication law
as color blindness and male sex are associated. Color blindness is usually
found in male children only
56 Biostatistics

PROBABILITY TREE DIAGRAMS

This shows all the possible events. The first event is represented by a dot. From the
dot, branches are drawn to represent all possible outcomes of the event. The
probability of each outcome is written on its branch
There are two forms
▪ Probability with replacement
▪ Probability without replacement
Steps in tree diagrams
1. Draw the Probability Tree Diagram and write the probability of each branch
2. Look for all the available paths (branches) of a particular outcome
3. Multiply along the branches and add vertically to find the probability of the
outcome as shown below in the examples
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑢𝑐𝑐𝑒𝑠𝑠𝑓𝑢𝑙 𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝑠 𝑜𝑓 𝑤𝑎𝑛𝑡𝑒𝑑 𝑖𝑡𝑒𝑚
Theoretical probability =
𝑇𝑜𝑡𝑎𝑙 𝑝𝑜𝑠𝑠𝑖𝑏𝑙𝑒 𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝑠

Probability with replacement


This is used to determine the probability of outcome if the item is selected and
returned back to the same sample space. The sample space item remains the same
When an item is selected, the item is returned back to sample space, so the number
of elements of the sample space remains unchanged

1. A bag contains 3 black balls and 5 white balls. A ball is picked at


random from the bag and replaced back in the bag. The balls are mixed
in bag and then another ball is picked at random
a) Draw a probability tree

b) Calculate the probability that


i. Two black balls
3 3 9
P (Two black balls) = P(B) x P(B) = 𝑥 =
8 8 64
ii. Black and white ball is picked
3 5 15
P (Black and White ball) = P(B) x P(W) = 𝑥 =
8 8 64
iii. A black ball in second draw
There are two possible outcomes P(B, B) or P(W, B)
P( Two black balls) + P(white ball and Black ball)
3 3 5 3 9 15 24 3
=( 𝑥 ) +( 𝑥 ) = + = =
8 8 8 8 64 64 64 8
Probability 57

2. A box contains 8 blue and 4 red balls. A is drawn at random and then
replaced. A second ball is then also drawn at random. Calculate the
probability of getting
a) At least one blue: hint P(B, B) or P(B, R) or P(R, B)
b) One red and one blue: hint P(B,R ) or P(R, B)
c) Two of the same color: hint P(B, B) or P(R, R)
3. 16 capsules are added a tin in a Pharmacy with 7 white and the rest being
green. A capsule is picked at random and returned, the a second pick is
made
Find the probability
a) Both capsules are green
b) One capsule is white and another is green
4. There are 7 blue tablets and 5 yellow tablets in a small container all used
in treatment of same condition but from different manufacturers. A tablet
is drawn times from the container while replacing back. Draw a
probability tree and find the probability of picking
a) All blue b) Two blue and one yellow c) blue, yellow and blue
d) blue and two yellow e) All yellow
Probability without replacement
After one element is picked from the container, and not replaced. The sample space
for the next picking for the next picking will be less by 1
For examples if a container has 13 elements, if 1 is picked and not replaced. The
sample space will be 12 for the next picking
1. A container consists of 21 balls; 12 are green and 9 are blue. Picked two
balls are picked at random.

Note:
After 1 green is picked, 20 balls remain: 11 green and 9 blue
After 1 blue is picked, 20 balls remain: 12 green and 8 blue
Find the probability that
a) both balls are blue
P(Blue and Blue) = P(B) x P(B)
9 8 72 6
= 𝑥 = =
21 20 420 35
b) One ball is blue and one ball is green.
c) If a third ball is drawn, find the probability it is
i) green
ii) At least 1 is blue (Hint elaborate the tree diagram to third draw)
58 Biostatistics

2. 16 capsules are added a tin in a Pharmacy with 7 white and the rest being
green. A student picks two capsules at random without replacement.
Find the probability
a) Both capsules are green
b) One capsule is white and another is green
3. A patient has a bag containing 7 yellow lozenges and 5 red lozenges. He
eats one lozenge at a go and then a second one after some time
Find the probability that patient’s eats
a) Yellow first and a red lozenge second
b) Two red lozenges
c) Two lozenges with the same color
4. There are 7 blue tablets and 5 yellow tablets in a small container all used
in treatment of same condition but from different manufacturers. A tablet
is drawn times from the container without replacing back. Draw a
probability tree and find the probability of picking
a) All blue b) Two blue and one yellow c) blue, yellow and blue
d) blue and two yellow e) All yellow
b) Binomial law of probability distribution
Binomial distribution summarizes the number of trials, or observations when each
trial has the same probability of attaining one particular value. Most variables show
a particular pattern of frequency distribution and are depicted by standard deviation
and binomial distribution. The binomial distribution determines the probability of
observing a specified number of successful outcomes in a specified number of trials.
A binomial experiment is a statistical experiment that has the following properties:
▪ The experiment consists of n repeated trials.
▪ Each trial can result in just two possible outcomes. We call one of these
outcomes a success and the other, a failure.
▪ The probability of success, denoted by P, is the same on every trial.
▪ The trials are independent; that is, the outcome on one trial does not affect
the outcome on other trials.
The expected value, or mean, of a binomial distribution, is calculated by multiplying
the number of trials by the probability of successes.
Such as when you flip, a coin 2 times and count the number of times the coin lands
on heads. This is a binomial experiment because: the experiment consists of
repeated trials, each trial can result in just two possible outcomes - heads or tails, the
probability of success is constant - 0.5 on every trial and the trials are independent;
that is, getting heads on one trial does not affect whether we get heads on other trials
The population under observation can be divided into two distinct groups like male
and female, live birth and still birth
When two children are born one after the other, the possible sequences will be any
of the following four:
Probability 59

The probability of getting male or female child is ½. The word “and” is used
between the events implies multiplication

Chances of getting 2 males (first sequence): ½ x ½ = ¼ = ¼ x 100 = 25%


Chances of getting 2 females (fourth sequence): ½ x ½ = ¼ = ¼ x 100 = 25%
Chances of getting one of either sex will be total of second and third sequence
Second sequence = ½ x ½ = ¼
Third sequence = ½ x ½ = ¼
Total = ¼ + ¼ = ½ = ½ x 100 = 50%
So, if a female child is born first and a second child is desired, the probability of the
second child being female 25% (2 females = ¼ = 25%) and being male will be 75%
(100 – 25 = 75%)
Similarly, when three children are born the possible sequences will be any one of
the 8 given below:

The probability of getting male or female child is ½.


1
Probability of getting 3 males (first sequence) = ½ x ½ x ½ =
8
1
Probability of getting 3 females (Eighth sequence) = ½ x ½ x ½ =
8
Probability of getting 2 males and 1 female (sum of sequence numbers 2, 3 and 5) =
1 1 1 3
+ + =
8 8 8 8
Probability of getting 1 male and 2 females (sum of sequence numbers 4, 6 and 7) =
1 1 1 3
+ + =
8 8 8 8
1 1 3 3 8
The probability = + + + = =1
8 8 8 8 8
In case of 3 siblings, the proportion of percentages of four will be:
Combinations Probability (fraction/ decimal / percentage)
Three females 1
= = 0.125 = 12.5%
8
Three males 1
= = 0.125 = 12.5%
8
Two females one male 3
= = 0.375 = 37.5%
8
Two males one female 3
= = 0.375 = 37.5%
8
60 Biostatistics: By Mwesigwa Wilson

The chance of getting 2 of one sex and one of the opposite sex = 37.5 + 37.5 = 75%
But if the first 2 are female children and the third is desired to be male, the chances
are 75 + 12.5 = 87.5%, because probability of all three being females is 12.5% only,
i.e. 100 – 12.5 = 87.5%
Question 2
Find the probability that when a couple has 3 children, they will have exactly 2
boys. Assume that boys and girls are equally likely and that the gender of any child
is not influenced by the gender of any other child.
From the above
1 1 1 3
P(2 boys in 3 births) = + + = = 0.375
8 8 8 8
BINOMIAL LAW OF PROBABILITY DISTRIBUTION
Form the above p = the probability of a ‘success’, q = probability of ‘failure or not
happening’. Thus p + q = 1. In child birth; getting a boy (value of p) or a girl (value
of q)
Binomial law of probability distribution is formed by the terms of the expansion of
the binomial expression (p + q)n where n = sample size or number of events
Examples
In child birth; getting a boy (value of p) or a girl (value of q)
When n = 2, the terms of the expansion of (p + q)2 are p2, 2pq and q2
(p + q)2 = (p + q) (p + q) = p2 +2pq+q2
This means p2 would mean probability of getting 2 boys, q2 of 2 girls and 2pq of one
boy and one girl
Where n=3, the terms of the expansion of (p + q)3 = p3, 3p2q, 3pq2, and q3
(p + q)3 = p3+3p2q+3pq2+q3. This means 3 boys (p3), 3 girls (q3), 1 boy and 2 girls
(3pq2), and 2 boys and 1 girl (3p2q)
When n = 4, the terms of the expansion of (p + q)4 are p4, 4p3q, 6p2 q2, 4pq3 and q4
(p + q)4 = p4+4p3q+6p2 q2+4pq3+q4

Examples
For next two questions; after research in a certain village it was found out that 65%
were boys (p). So girls (q) = 1 - 0.65 = 0.35

1 What are the chances of getting 2 boys, 2 girls, or one girl and one boy
after two pregnancies?
Number of pregnancies 2; get 3 possible outcomes
(p + q)2 = p2 +2pq+q2; this means p2 would mean probability of getting 2
boys, q2 of 2 girls and 2pq of one boy and one girl
(p + q)2 = p2 +2pq+q2
(p + q)2 = (0.65)2 + (2x0.65 x 0.35) +(0.35)2
= 0.4225 + 0.455 + 0.1225
Probability 61

Probability getting
▪ 2 boys (p2) = 0.4225 or 42.25%
▪ 2 girls(q2) = 0.1225 or 12.25%
▪ One boy and one girl (2pq) = 0.455 or 45.5%
2 What are the chances of getting 3 boys, 3 girls, or 1 boy and 2 girls, and 2
boys and 1 girl after three pregnancies
Form (p + q)3 = p3+3p2q+3pq2+q3; means 3 boys (p3), 3 girls (q3), 1 boy
and 2 girls (3pq2), and 2 boys and 1 girl (3p2q)
(p + q)3 = p3+3p2q+3pq2+q3
(p + q)3 = (0.65)3 +(3x 0.652 x 0.35)+ (3x 0.65 x 0.352)+(0.35)3
= 0.2746 + 0.4436 + 0.2389 + 0.0429
The probability of getting
▪ 3 boys (p3) = 0.2746 or 27.46%
▪ 3 girls (q3) = 0.0429 or 4.29%
▪ 1 boy and 2 girls (3pq2) = 0.2389 or 23.89%
▪ 2 boys and 1 girl (3p2q) = 0.4436 or 44.36%
3 In four projects, 20% of children under 8 years of age were found to be
severely malnourished. If only 4 children were selected at random from
the four projects, what is the probability of selecting a child who is
severely malnourished and none malnourished (healthy child)
Let p be severely malnourished and q-be non- malnourished (healthy
child)
Thus; p=0.2 and q=1-0.2 = 0.8, n=4
(p + q)4 = p4+4p3q+6p2 q2+4pq3+q4
The probability of getting
▪ All 4 severely malnourished (p4) = (0.2)4 = 0.0016 or 0.16%
▪ Only 3 severely malnourished (4p3q) = 4x 0.23x0.8 =
▪ Only 2 severely malnourished (6p2q2) = 6x 0.22x0.82=
▪ Only 1 severely malnourished (4pq3) = 4x 0.2x0.83=
Non- malnourished (healthy child) (q4) = (0.8)4 = 0.4096 = 40.96%
c) Probability from shape of normal distribution or normal curve
A normal distribution, sometimes called the bell curve because the graph of its
probability density looks like a bell
Also known as called Gaussian distribution, after the German mathematician Carl
Gauss who first described it;
It is a distribution that occurs naturally in many situations especially in many
continuous data such as heights of people. The bell curve is symmetrical. Half of the
data will fall to the left of the mean; half will fall to the right.
62 Biostatistics

The standard deviation controls the spread of the distribution. A smaller standard
deviation indicates that the data is tightly clustered around the mean; the normal
distribution will be taller. A larger standard deviation indicates that the data is
spread out around the mean; the normal distribution will be flatter and wider.
The area under the normal distribution curve represents probability and the total area
under the curve sums to one
Most of the continuous data values in a normal distribution tend to cluster around
the mean, and the further a value is from the mean, the less likely it is to occur. The
tails are asymptotic, which means that they approach but never quite meet the
horizon (i.e. x-axis).
For a perfectly normal distribution the mean, median and mode will be the same
value, visually represented by the peak of the curve.
What is the empirical rule formula?
The empirical rule in statistics allows researchers to determine the proportion of
values that fall within certain distances from the mean. The empirical rule is often
referred to as the three-sigma rule or the 68-95-99.7 rule.
If the data values in a normal distribution are converted to standard score (z-score)
in a standard normal distribution the empirical rule describes the percentage of the
data that fall within specific numbers of standard deviations (σ) from the mean (X)
for bell-shaped curves.
The empirical rule allows researchers to calculate the probability of randomly
obtaining a score from a normal distribution
▪ 68% of data falls within the first standard deviation from the mean. This
means there is a 68% probability of randomly selecting a score between -1
and +1 standard deviations from the mean.
▪ 95% of the values fall within two standard deviations from the mean. This
means there is a 95% probability of randomly selecting a score between -2
and +2 standard deviations from the mean.
▪ 99.7% of data will fall within three standard deviations from the mean. This
means there is a 99.7% probability of randomly selecting a score between -3
and +3 standard deviations from the mean.
d) Probability of calculated values from tables
Probability of calculated values occurring by chance in case of ‘t’ and χ2 (Chi-
square) is determined by referring to the respective tables. The tables are found in
textbooks
Conditional probability 63

CONDITIONAL PROBABILITY: BAYES’ THEOREM

Thomas Bayes was an English statistician, philosopher who is known for


formulating a specific case of a theorem. Bayesian methods stem from the principle
of linking prior (before conducting experiment) probability and conditional
probability (likelihood) to posterior (after conducting experiment) probability via
Bayes’ rule.
Bayes’ theorem provides the actual probability of an event given the information
about tests.
▪ Events; they are different from “tests.” or example, there is test for diseases
▪ Tests are flawed: just because there is a positive test does not mean is
actually presence of a disease. Many tests have a high false positive rate.
Bayes’ theorem takes the test results and calculates the real probability that
the test has identified about the event.
Conditional Probability
This is the probability of one event occurring with some relationship to one or more
other events.

For example:
If A and B are two events, then the conditional probability A given B is written as P
(A | B); read as “the probability of A given that B has already occurred.”
The probability of A given B
P(A and B) P(A ∩ B)
P(A|B) = = ; Where P(B) > 0
P(B) P(B)

The probability of B given A


P(A and B) P(A ∩ B)
P(B|A) = = ; Where P(A) > 0
P(A) P(A)

Where:
▪ P(A|B) – the conditional probability; the probability of event A occurring
given that event B has already occurred
▪ P(A ∩ B) – the joint probability of events A and B; the probability that both
events A and B occur
▪ P(B) – the probability of event B
Conditions for the independence of two events A and B
P(A/B) = P(A) P(B/A) = P(B)
P(A and B) = P(A) x P(B)
64 Biostatistics

In a group of 100 patients at the Pharmacy, 40 bought Coartem, 30


purchased Paracetamol, and 20 purchased Coartem and Paracetamol. If a
patient is chosen at random bought an Coartem, what is the probability they
also bought Paracetamol?

Note: 10 patients bought neither Coartem nor paracetamol. Let A represent


patients who bought Coartem and B those who bought paracetamol
40
P(A) = = 0.4
100
20
P(A∩B) = = 0.2
100
P(A ∩ B) 0.2
P(B|A) = = = 0.5 or 50%
P(A) 0.4

Evaluating screening test


Screening test are laboratory tests that are used to detect particular markers of a
specific disease. The screening tests are often used in clinical practice to assess the
likelihood that a person has a particular medical condition. The rationale is that, if
disease is identified early (before the manifestation of symptoms), then earlier
treatment may lead to cure or improved survival or quality of life.
There may be false-negative and false-positive results after diagnosis of diagnosis
which can alter the management of a patient
The probability of disease after learning the results of a test is called the post-test
probability of disease; and may be calculated by using Bayes' theorem.
Consider the table below; shows n total subjects which regard to a disease and
results from screening test
Test result Present (D+) Absent (D-) Total
Positive (T+) True positive - False positive a+b
TP: a - FP: b
Negative (T-) False negative True negative c+d
- FN: c - TN: d
Total a+c b+d n
Terms used
▪ Bayes' theorem (in screening test): an algebraic expression for calculating
the posttest probability of disease if the pretest probability of disease and the
sensitivity and specificity of a test are known.
▪ True positive results: indicates a positive status when the true status is
positive
Conditional probability 65

▪ False positive (FP) results when a test indicates a positive status when the
true status is negative
▪ True negative results: test indicates a negative status when the true status is
negative
▪ False negative results when a test indicates a negative status when the true
status is positive.
▪ Sensitivity of the test (true-positive rate, or TPR) - this represents the
likelihood of a positive test in a diseased person or the conditional
probability of a positive test result given the presence of disease P(T+/D+)
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑖𝑠𝑒𝑎𝑠𝑒𝑑 𝑝𝑎𝑡𝑖𝑒𝑛𝑡𝑠 𝑤𝑖𝑡ℎ 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 𝑡𝑒𝑠𝑡 𝑎
= =
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑖𝑠𝑒𝑎𝑠𝑒𝑑 𝑝𝑎𝑡𝑖𝑒𝑛𝑡𝑠 𝑎+𝑐

▪ Specificity of the test (true-negative rate; abbreviated TNR: the likelihood of


a negative test result in a patient without disease) or the conditional
probability of a negative test result given the absence of disease
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑛𝑜𝑛𝑑𝑖𝑠𝑒𝑎𝑠𝑒𝑑 𝑝𝑎𝑡𝑖𝑒𝑛𝑡𝑠 𝑤𝑖𝑡ℎ 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑡𝑒𝑠𝑡 𝑑
= =
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑛𝑜𝑛𝑑𝑖𝑠𝑒𝑎𝑠𝑒𝑑 𝑝𝑎𝑡𝑖𝑒𝑛𝑡𝑠 𝑏+𝑑
▪ Accuracy of a test: the probability of true result
𝑎+𝑑
=
𝑛
▪ Predictive value positive of a screening test: this is the conditional
probability that a subject or patient has a disease given the that the subject or
patient has a positive screening test result. It can also be defined as the
probability that subjects with a positive screening test truly have the disease.
𝑇𝑃 𝑎
= =
𝑇𝑃+𝐹𝑃 𝑎+𝑏
▪ Predictive value negative of a screening test: the conditional probability that
a subject or patient does not have a disease given the that the subject or
patient has a negative screening test result.
It can also be defined as the probability that subjects with a negative
screening test truly don't have the disease.
𝑇𝑁 𝑐
= =
𝑇𝑁+𝐹𝑁 𝑐+𝑑
▪ Prevalence: how much disease is present in a community.
𝑎+𝑐
=
𝑛
▪ Posttest probability: the probability of disease after the results of a test have
been learned (posterior probability, posttest risk).
▪ Pretest probability: the probability of disease before doing a test (prior
probability, pretest risk).
66 Biostatistics

Examples

1. Consider a screening test for Down Syndrome. In pregnancy, women often


undergo screening to assess whether their fetus is likely to have Down
Syndrome. The screening test evaluates levels of specific hormones in the
blood. Suppose that a population of N=4,810 pregnant women undergo the
screening test and are scored as either positive or negative depending on the
levels of hormones in the blood. In addition, suppose that each woman is
followed to birth to determine whether the fetus was, in fact, affected with
Down Syndrome. The results of the screening tests are summarized below.
Screening Down No Down Total
test syndrome syndrome
Positive 9 351 360
Negative 1 4449 4450
Total 10 4800 4810
Calculate the sensitivity, specificity, false positive fraction, false negative
fraction, positive predictive valve, negative predictive value of the screening
test
9
Sensitivity: P(Screen positive/Affected fetus) = = 0.9 ≈ 90% (If a woman is
10
carrying an affected fetus, there is a 90.0% probability that the screening test
will be positive)
4449
Specificity: P(Screen negative/unaffected fetus) = = 0.927 ≈92.7% (If the
4800
woman is carrying an unaffected fetus, there is a 92.7% probability that the
screening test will be negative)
351
False positive fraction: P(Screen positive / Unaffected fetus) = =
4800
0.073≈7.3% (If a woman is carrying an unaffected fetus, there is a 7.3%
probability that the test will incorrectly come back positive. This is potentially
a serious problem as a positive test result would likely produce great anxiety
for the woman and her family)
1
False negative fraction: P(Screen negative / Affected fetus) = = 0.1≈10%
10
(If a woman is carrying an affected fetus, there is a 10.0% probability that the
test will come back negative, and the woman and her family might feel a false
sense of assurance that the fetus is not affected when, in fact, the screening
test missed the abnormality)
9
Positive predictive valve: P(Affected fetus / Screen positive) = =
360
0.025≈2.5% (If a woman screens positive, there is a 2.5% probability that she
is carrying an affected fetus.
4449
Negative predictive value: P(Unffected fetus / Screen negative) = =
4450
0.999≈ 99.9% (If a woman screens negative, there is a 99.9% probability that
she is carrying an unaffected fetus)
Conditional probability 67

2. Suppose that a population of N=120 men over 50 years of age who are
considered at high risk for prostate cancer have both the prostate-specific
antigen (PSA) screening test and a biopsy. The PSA results are reported as
low, slightly to moderately elevated or highly elevated based on the following
levels of measured protein, respectively: 0-2.5, 2.6-19.9 and 20 or more
nanograms per milliliter. The biopsy results of the study are shown below.
Prostate No Prostate
PSA Level (Screening Test) Totals
Cancer Cancer
Low(0-2.5 ng/ml) 3 61 64
Slight/Moderate Elevation
13 28 41
(2.6-19.9 ng/ml)
Highly Elevated (>29 ng/ml) 12 3 15
Totals 28 92 120
Calculate the predictive value that a man has prostate cancer given he has a,
moderately elevated and highly elevated levels of PSA (Ans 0.047, 0.317 and
0.80 respectively)

3. The following information was obtained after research by a hospital on


screening test for dementia. The test was used in 630 patients with symptoms
of dementia and 850 patients without symptoms. The results a shown below
Test result Yes / Present No / Absent Total
Positive 548
Negative 835
Total 630 850
Calculate the sensitivity, specificity, false positive fraction, false negative
fraction, positive predictive valve, negative predictive value of the screening
test
68 Biostatistics

THE LAW OF TOTAL PROBABILITY


If A1, A2, A3…, An are n mutually exclusive and collectively exhaustive events and
event B is a subset of the union of A1, A2,…, An

Then; P(B) = P(A1nB) + P(A2nB) + P(A3nB) … P(AnnB)


But: P(A1nB) = P(A1) x P(B|A1)
P(A2nB) = P(A2) x P(B|A2)
P(A3nB) = P(A3) x P(B|A3)
P(B) = P(A1) x P(B|A1) + P(A2) x P(B|A2) + P(A3) x P(B|A3)
𝑛
P(B) = ∑ 𝑃(𝐴𝑖)𝑃(𝐴𝑖|𝐵) ; the law of total probability
𝑖=1,2,3….𝑛

Bayes theorem
From conditional probability;
P(A ∩ B)
P(A|B) =
P(B)
:. P(AnB) = P(A|B) x P(B) ------- i
P(A ∩ B)
P(B|A) = ;
P(A)
:. P(AnB) = P(B|A) x P(A) -------- ii
Equation i and ii
P(A|B) x P(B) = P(B|A) x P(A)
𝑃(𝐵|𝐴) 𝑥 P(A)
P(A|B) = ----------- iii
P(B)

From total probability; P(B)


BAYES’ THEOREM;
This is also known as Bayes’ rule or Bayes’ Law
𝑃(𝐵|𝐴𝑖) 𝑥 P(𝐴𝑖)
P(Ai|B) = 𝑛
∑𝑖=1,2,3….𝑛 𝑃(𝐴𝑖) 𝑥 𝑃(𝐵|𝐴𝑖)𝑘
𝑃(𝐵|𝐴1) 𝑥 P(𝐴1)
P(A1|B) =
𝑃(𝐵 |𝐴1)𝑥 P(𝐴1)+ 𝑃(𝐵 |𝐴2)𝑥 P(𝐴2)…..𝑃(𝐵|𝐴𝑛) 𝑥 P(𝐴𝑛)

In scenarios involving Bayes’ Theorem, always


▪ Prior probabilities are given- P(A1)
▪ Conditional probabilities – P(B|A1); B indicates the intersection of the three
or more events
▪ Question requires one to determine the posterior probability. Posterior
probability in Bayes’ theorem is the revised or updated probability of an
event after taking consideration of new information (event affecting others)
▪ Bayes theorem; determines the probability of the cause which has lead to a
problem
Conditional probability 69

1. In order to manage credit history and risk at a whole sale Pharmacy, the
company rates the borrowers as lowest risk, medium risk and highest risk.
Risk means the chance that a borrower might fail to pay back. Based on
historical data; on average 30% customers are lowest risk, 60% rated medium
risk and 10% rated highest risk. After survey it was found out that 1% lowest
risk, 10% medium risk and 18% highest risk customers failed to pay back the
loan of products purchased on loan. If a customer was randomly picked from
defaulters list
a) What is the probability that they had received a lowest risk rating?
Let A1 - lowest risk customers, A2 - medium risk customers and A3 - highest
risk customers.
Defaulter – D (in Bayes’ theorem)
Based on scenario; P(Rating A1/Defaulter) = The question
However;
P(Rating A1) = 30% = 0.3
P(Rating A2) = 60% = 0.6
P(Rating A3) = 10% = 0.1
P(Defaulter|Rating A1) = 1% = 0.01
P(Defaulter|Rating A2) = 10% = 0.1
P(Defaulter|Rating A2) = 18% = 0.18
Bayes’ Theorem
𝑃(𝐷|𝐴1) 𝑥 P(𝐴1)
P(rating A1|Defaulter) = 𝑛
∑ 𝑃(𝐴𝑖) 𝑥 𝑃(𝐷|𝐴𝑖)
𝑖=1,2,3
𝑃(𝐷|𝐴1) 𝑥 P(𝐴1) = 0.01 x 0.3 = 0.003
𝑛
=∑𝑖=1,2,3 𝑃(𝐴𝑖) 𝑥 𝑃(𝐷|𝐴𝑖)𝑘 = P(A1) x P(D|A1) + P(A2) x P(D/A2) + P(A3) x
P(D|A3)
= 0.01 x 0.3 x 0.1 + 0.1 x 0.6 + 0.18 x 0.1= 0.081
0.003
P(rating A1|Defaulter) = = 0.037 = 3.7%
0.081
The probability the customer picked was given lowest risk rating = 3.7%
b) What is the probability that they had received a medium risk rating?
𝑃(𝐷|𝐴2) 𝑥 P(𝐴2)
P(rating A2|Defaulter) = 𝑛
∑ 𝑃(𝐴𝑖) 𝑥 𝑃(𝐷/𝐴𝑖)
𝑖=1,2,3
0.1 x 0.6
= = 0.333 = 33.4%
0.081
c) What is the probability that they had received a highest risk rating?
𝑃(𝐷|𝐴3) 𝑥 P(𝐴3)
P(rating A3|Defaulter) = 𝑛
∑ 𝑃(𝐴𝑖) 𝑥 𝑃(𝐷/𝐴𝑖)
𝑖=1,2,3
0.18 x 0.1
= = 0.222 = 22.2%
0.081
70 Biostatistics: By Mwesigwa Wilson

2. A certain virus infects one in every 400 people. A test used to detect the virus
in a person is positive 85% of the time if the person has the virus and 5% of
the time if the person does not have the virus. Consider a sample of 10,000
people
a) Find the probability that a person has a virus given that they have
tested positive
The question can be attempted by using conditional probability or Bayes’
theorem
Let A be event the person has the virus and B be the event the person tests
positive
Let A’ be event the person has the no virus and B’ be the event the person tests
negative
1
P(A) = = 0.0025 (total people with virus = 0.0025 x 10 000 = 25 )
400
400 −1
P(A’) = = 0.9975 (total people with no virus = 0.9975 x 10 000 = 9975
400
P(B│A) = 85%=0.85 (test positive with the virus = 0.85 x 25 = 21.25)
P(B|A’) = 5% = 0.05 (test positive with no virus = 0.05 x 9975 = 498.75)
Note: Decimal are points used in probability
Positive B Negative B’ Total
Virus A 21.25 3.75 25
No virus A’ 498.75 9476.25 9975
Total 520 9480 10 000
The probability that a person has a virus given that they have tested positive
21.25
P(A|B) = = 0.0409 = 4.09%
520
Use of Bayes’ theorem;
P(A) x P(B|A)
P(A|B) =
P(A)x P(B|A)+ P(A′) x P(B|A′)
0.0025 𝑥 0.85
= = 0.409 = 4.09%
0.0025 𝑥 0.85+0.9975 𝑥 0.5
b) Find the probability that a person does not virus given they test
negative
9476.25
P(A’|B’) = = 0.9996 = 99.96%
9480

Note: The following questions are incomplete


3. b) If a stethoscope failed, what is probability that it had received quality score
B
𝑃(𝐹|𝐵) 𝑥 P(𝐵)
P(B|F) =
𝑃(𝐹|𝐴) 𝑥 P(𝐴)+ 𝑃(𝐹|𝐵) 𝑥 P(𝐵)+ 𝑃(𝐹|𝐶) 𝑥 P(𝐶)
P(F|B) x P(B)
Or P(B|F) =
P(F)

4. In a manufacturing factory for drug packaging containers, machines A, B and


C produce 25%, 35% and 40% containers respectively. Out the containers,
5%, 4% and 2% of the containers produced by A, B, C are defective. What is
the probability that a defective container is drawn randomly and manufactured
by machine B
Conditional probability 71

5. Given the following statistics,


▪ 1% of women over 50 have breast cancer
▪ 90% of women who have breast cancer test positive on mammogram
▪ 8% of women will have false positive
What is the probability that woman has cancer if she has a positive
mammogram test
Let A represent women with breast cancer
Let A’ represent women without breast cancer
Let X represent positive mammogram result
P(A) = 0.01 P(A’) = 0.99
P(X|A) =0.9 P(X|A’) =0.8
𝑃(𝐴)𝑥 𝑃(𝑋|A)
P(A|X) =
𝑃(𝐴)𝑥 𝑃(𝑋 |A)+ 𝑃(𝐴′ )𝑥 𝑃(𝑋|A′ )

6. In a particular pain clinic, 10% of patients are prescribed narcotic painkillers.


Overall, 5% of clinic patients are addicted to narcotics (including painkillers
and illegal substances). Out of all patients prescribed painkillers, 8% are
addicts. If a patient is an addict, what is probability that they are prescribed
pain killers
A – being prescribed painkillers = P(A) 0.1
B – Being addict – all (total probability) = P(B) 0.05
B|A = prescribed painkiller ≈addict = P(B|A)0.8
P(A) x P(B|A)
P(A|B) = =
𝑃(𝐵)

7. A particular study showed that 12% of men will likely develop prostate cancer
at some point. A man with prostate cancer has 95% chance of a positive test
result after screening. A man without prostate has 6% chance of getting a false
test result. What is probability that a man has a cancer given he has a positive
result (0.683)
8. A factory has two machines I and II. Machine I produces 40% of items of the
output and Machine II produces 60% of the items. Further 4% of items
produced by Machine I are defective and 5% produced by Machine II are
defective. An item is drawn at random. If the drawn item is defective, find the
probability that it was produced by Machine II.
9. The chances of X, Y and Z becoming managers of a certain company are 4 : 2 :
3. The probabilities that bonus scheme will be introduced if X, Y and Z
become managers are 0.3, 0.5 and 0.4 respectively. If the bonus scheme has
been introduced, what is the probability that Z was appointed as the manager?
Let A1, A2 and A3 be the events of X, Y and Z becoming managers of the
company respectively. Let B be the event that the bonus scheme will be
introduced.
We have to find the conditional probability P (A3|B).
Total ratio= 4+2+3 = 9
4 2 3
P(A1) = P(A2) = P(A3) =
9 9 9
P(B|A1) = 0.3 P(B|A2) = 0.5 P(B|A3)= 0.4
P(A3) x P(B|A3)
P(A3|B)=
P(A1) x P(B|A1)+P(A2) x P(B|A2)+P(A3) x P(B|A3)
72 Biostatistics

BINOMIAL PROBABILITY DISTRIBUTION

This is applied when a random process or experiment, called a trial, can result in
only one of the two mutually exclusive and collectively exhaustive outcomes, called
a success and a failure. Examples include dead or alive; sick or well; boy or girl;
pass or fail and others; referred to as binomial experiment (Bernoulli trial - named
in honour of the Swiss mathematician Jacob Bernoulli who was one of the many
prominent mathematicians 1654–1705)
A binomial experiment or Bernoulli trial is a statistical experiment that has the
following properties or satisfies the following four conditions:
▪ The experiment consists of n repeated trials or there are n identical trials all
performed under identical conditions
▪ Each trial results in one of the two mutually exclusive and collectively
exhaustive outcomes; called a success and a failure.
▪ The probability of success, denoted by P, and the probability of a failure is
denoted by q, and p + q = 1; are the same on every trial or they remain
constant for each trial
▪ The trials are independent; that is, the outcome on one trial does not affect
the outcome on other trials.
Notation
The following notation is helpful in binomial probability.
▪ X: The number of successes that result from the binomial experiment in n
trials is called binomial random variable; x = 0,1,2,3 …….n
▪ n: The number of trials in the binomial experiment.
▪ P: The probability of success on an individual trial. P(X) gives the
probability of successes in n binomial trials
▪ Q: The probability of failure on an individual trial. (This is equal to 1 - P.)
▪ n!: The factorial of n (also known as n factorial). The factorial of n is the
product of number up to 1. 0! = 1, 4! = 4 x 3 x 2 x 1
▪ b(x; n, P): Binomial probability - the probability that an n-trial binomial
experiment results in exactly x successes, when the probability of success on
an individual trial is P.
▪ nCr: The number of combinations of n things, taken r at a time.. nCx is a
combination
Let "n" denote the number of observations or the number of times the process is
repeated, and "x" denotes the number of "successes" or events of interest occurring
during "n" observations. The probability of "success" or occurrence of the outcome
of interest is indicated by "p".
The probability distribution of the random variable X is called a binomial
distribution, and is given by the formula:
n n!
Cx = (from permutation and combination)
x! ∗ (n−x)!

P(X) = nCx px qn-x


n!
P(X) = px q (n-x)
x! ∗ (n−x)!
Binomial probability distribution 73

n!
Or P(X“successes”) = px (1-p) (n-x)
x! ∗ (n−x)!

Where x = 0, 1, 2, 3, ........ n, q = 1 – p
The binomial distribution has the following properties:
▪ The mean value of binomial distribution (μ) = np
▪ The variance of binomial distribution (σ2) = np(1-p) or npq
▪ The standard deviation (σx) = √𝑛𝑝(1 − 𝑝)
1. Suppose that 80% of adults with allergies report symptomatic relief with a
specific medication. If the medication is given to 10 new patients with
allergies;
a) What is the probability that it is effective in exactly seven?
Observation; n = 10
Event of interest or success; x=7
p = 80% = 0.8
10!
P(X=7) = 0.87 (1-0.8) (10 - 7)
7! ∗ (10−7)!
10 𝑥 9 𝑥 8 𝑥 7!
P(X=7) = x 0.2097 x 0.008
7! 𝑥 3 𝑥 2 𝑥 1
720
P(X=7) = x 0.2097 x 0.008 = 120 x 0.2097 x 0.008
6
P(X=7) = 0.2013
There is a 20.13% probability that exactly 7 of 10 patients will report relief
from symptoms when the probability that any one reports relief is 80%.
b) What is the probability that none will report effectiveness of the drug?
X=0
10!
P(X=0) = 0.80 (1-0.8) (10 - 0)
0! ∗ (10−0)!
10!
P(X=0) = 0.8 (0.2) 10 = 0.0000001024
0
10!
There is practically no chance that none of the 10 will report relief from
symptoms when the probability of reporting relief for any individual patient
is 80%.

2. It was noticed in a hospital FPRRH that 90% of pregnancies had delivery in


week 37 or later (full-term birth). If 5 birth records are randomly selected
from the population, what is the probability that:
a) Exactly five of the records will be for full-term births?
n= 5, x = 5, p = 90% or 0.9
5!
P(X=5) = 0.95 (1-0.9) (5 - 5)
5! ∗ (5−5)!
P(X=5) = 0.95 = 0.5905
There is a 59.05% probability that all will be for full-term births
a) At least three of the records will be for full-term births?
b) What are the mean, variance and standard deviation of the number of
full-term birth?
74 Biostatistics

Computing the probability of range outcomes


▪ No more than 1 = P(0 or 1 successes) = P(0 successes) + P(1 success)
▪ Exactly 5 = P(5 successes)
▪ 5 or more of 8 = P(5) + P(6) + P (7) + (8)
▪ More than 8 of 10 = P(9) + P(10)
▪ More than two but less than six =
▪ Between 5 and 8 = P(6) + P (7)
▪ Between 5 and 8, inclusive =
3. The likelihood that a patient with a heart attack dies of the attack is 4 of 100.
Suppose 5 patients are selected who suffer a heart attack
a) What is the probability that all will survive?
4
n= 5, x = 0, p = = 0.04
100
5!
P(X=0) = 0.040 (1-0.04) (5 - 0)
0! ∗ (5−0)!
5!
P(X=0) = x 1 x 0.965 = 0.8153
5!
There is an 81.54% probability that all patients will survive the attack
b) What is the probability that that no more than 1 person dies of the
heart attack?
No more than 1 = P(0 or 1 successes) = P(0 successes) + P(1 success)
4
P(0) = 0.8154, n= 5, x = 1, p = = 0.04
100
5!
P(X=1) = 0.04 (1-0.04) (5 - 1)
1
1! ∗ (5−1)!
5 x 4!
P(X=1) = 0.041 x0.964
1 ∗ 4!
P(X=1) = 5 x 0.04 x 0.8493 = 0.1697
P(0 or 1 successes) = P(x=0) + P(x=1)
P(0 or 1 successes) = 0.8154+ 0.1697
P(0 or 1 successes) = 0.9851
The probability that no more than 1 of 5 die from the attack is 98.51%.
4. After inspection of 15 pharmacy premises, it was found out 15% meet the
recommended guidelines. Find the probability
a) Less than 2 premises meet the guideline
b) Less than or equal to five meet the guidelines
5. About 60% of adults in Kabarole district have tested for HIV at some point
in their life. If 6 adults are randomly selected, find the probability that the
number of adults in the sample who have been tested for HIV at some point
in their life would be:
a) Exactly 3
b) More than 3
c) More than 5 but inclusive
6. At Diabetes / hypertension clinic at Fort Portal Regional hospital, 55% of the
patients have hypertension. If 9 patients are randomly selected. What is the
probability that
a) Exactly 5,
b) More than 6,
c) Between 4 and 8; will have hypertension
Hypothesis testing 75

HYPOTHESIS ANF HYPOTHESIS TESTING

HYPOTHESIS

This is an assumption or conjecture about a population parameter which may or may


not be true. Hypothesis define the relationship between variables and is a formal
question that a researcher has to resolve
A hypothesis may be defined as a proposition or a set of propositions set forth as an
explanation for the occurrence of some specific group of phenomena. A research
hypothesis is a predictive statement capable of being tested by scientific methods
that relate an independent variable to some dependent variable
Hypothesis is a definite statement about a population parameter
Theory on the other hand is an idea or set of ideas that is intended to facts or events
or an idea that is suggested or presented as possibly true but that is not known or
proven to be true
The hypothesis should contain; Subject group, control group, outcome measure
A parameter is the characteristic of the a population to be tested
Characteristics of a hypothesis
▪ Hypothesis should be clear and precise, must be specific. Stated in very
simple terms.
▪ Capable of being tested or measurable
▪ Relate to a variable both the dependent and independent.
▪ Consistent with most known facts and based on the information- review of
literature
▪ Point towards line of action or research design
Sources of hypothesis
▪ Findings of other studies or case
▪ Resemblance between the phenomenon
▪ Current popular beliefs, personal experiences and from competitors
▪ Scientific theories
Types of hypotheses
Two mutually exclusive types: null & alternative hypothesis
Null hypothesis
This is a statement, which shows no significant difference, no changes, and no
relationship between the parameters under study or a statement about a population
parameter that is assumed to be the true such as the mean. Null hypothesis states a
negative statement to support the researcher’s findings that there is no relationship
between two variables. The independent variable has no effect on the dependent
variable. Main hypothesis being tested. It is denoted by “Ho”.
Ho: µ1 = µ2
76 Biostatistics

Alternative hypothesis
This shows that there is significant difference, change, effect and relationship
between the parameters: Also the negative or logical opposite of null hypothesis or
the contrary to the null hypothesis by stating that actual value of population
parameter is less than, greater than or not equal to the value in the null hypothesis
Alternative hypothesis states that there is a relationship between the two variables of
the study and that the results are significant to the research topic. The independent
variable has effect on the dependent variable. It is denoted by “Ha” or “H1”.
H1: µ1 ≠ µ2, H1: µ1 > µ2 or H1: µ1 < µ2
Mathematical symbols used
Null hypothesis: Ho Alternative hypothesis: Ha or H1
(=) equal to, is, same as, not (≠) not equal, different from, changed from,
changed from, not same as
( ≥ ) greater than or equal to > : greater than, above, higher than, longer
than, bigger than, increased, at least
( ≤ ) less than or equal to < : less than, below, lower than, shorter than,
at most, smaller than, decreased or reduced
State the null and alternative hypotheses of the following statements

1. The average age of patients visiting hospital in Fort Portal city is 28 years
Ho: µ = 28 years
H1: µ ≠ 28 years
2. The school records claims that the mean score in Biostatics is 53% over the
last 2 years. The new tutor in biostatics wishes to find out if the claim is true.
The tutors tests if there is significant difference between the average marks in
school records and students in the his class
Null hypothesis: The mean score in biostatics of students is 53%; H o: µ =
53%
Alternative hypothesis: The mean score in biostatics of students is not
53%; Ho: µ ≠ 53%
3. The average number of drugs in a pharmacy is at most 850
Ho: µ = 850 drugs
H1: µ < 850 drugs
4. The researcher wants to test whether the average height of boys in certain
group is different from 153 cm
Ho: µ = 153 cm
H1: µ ≠ 153 cm
5. A pharmacy student wishes to carry out a study on the volume a cough syrup
packed in a bottle; if it is less than 500ml as indicated on the label
Ho: µ ≥ 153 cm
H1: µ < 153 cm
Hypothesis testing 77

Forms of hypotheses include


▪ Simple hypothesis: shows a relationship between one dependent variable and
a single independent variable. Such as if you spend more time reading
(independent variable), you will pass the exams highly (dependent variable)
▪ Complex hypothesis: shows the relationship between two or more
independent and dependent variables
▪ Directional hypothesis: derived from theory and specifies the direction to be
followed to determine the relation between variables
▪ Non-directional hypothesis: used when there is no theory involved. It is a
statement that a relationship exists between two variables, without predicting
the exact nature (direction) of the relationship.
▪ Associative hypothesis: occurs when there is a change in one variable
resulting in a change in the other variable
▪ Causal hypothesis, causal hypothesis proposes a cause and effect interaction
between two or more variables or proposes an effect on the dependent
variable due to manipulation of the independent variable

HYPOTHESIS TESTING

This is a decision-making process for evaluating clams about a population. It is


testing an assumption that is made about a population.
Basic concepts in hypothesis testing
▪ The hypothesis the researcher wants to test is called the alternative
hypothesis H1. The objective is to DISPROVE the null hypothesis.
▪ The Significance Level is the Critical probability of choosing between the
null hypothesis and the alternative hypothesis
General procedure for hypothesis testing
▪ Define the variables
▪ Formulate H1 and H0
▪ Choose level of significance
▪ Select appropriate test
▪ Calculate the test statistic
▪ Determine the probability associated with the statistic.
▪ Interpret results: compare with the level of significance, . Determine if the
critical value falls in the rejection region.
▪ Reject or do not reject null hypothesis
▪ Make or draw a conclusion
Define the variable
Independent variables are the ones which are manipulated, controlled, or
changed. Independent variables are isolated from other factors of the study.
Dependent variables, as name suggests are dependent on other factors of the study.
They are influenced by the change in independent variable.
78 Biostatistics

Formulate H1and H0
Null hypothesis represents status quo. Alternative hypothesis represents the desired
result.
Choose level of significance
The significance level states the probability of incorrectly rejecting or accepting H 0.
Types of error which can occur include: Type I and Type II errors (discussed later)
Usually, researchers select significance level of 5% (0.05) to 1% (0.01) and is set be
prior to actual testing
:. This means there is 5% (0.05) or 1% (0.01) probability of making a wrong
decision or researchers desires to be 95% to 99% confident before rejecting the null
hypothesis
Type I error
These are false positives which happen in hypothesis testing when the null
hypothesis is true but is rejected. Occurs when the researcher validated a statistically
significant difference when there isn’t any. A test with 95% confidence level means
that there is a 5% chance or probability of getting type 1 error. Or specifying a
criterion upon which the claim being tested is true or not
This can happen by inappropriate statistical analysis technique, poor sample
selection method, small sample size, chance or bad luck etc. This error is commonly
known as Type I error or false positive, denoted as alpha ()
Type error II
This occurs when the researcher accepts the null hypothesis that is false; denoted by
beta (). The probability of committing type II error called power of the test;
generally kept at 80% and determined by 1-β
This plays a role in sample size determination
Accept null Reject null
Null is true Correct – no error Type 1 error
Null is false Type error Correct – no error
Note: Both are serious, but traditionally Type I error has been considered more
serious, that’s why the objective of hypothesis testing is to reject H0 only when there
is enough evidence that supports it.
Therefore, the  is chosen to be as small as possible without compromising .
Increasing the sample size for a given α will decrease β
Select appropriate test for the hypothesis
The test statistic generates the p-value used to determine whether the null hypothesis
should be rejected or retained (accepted). The selection of a proper test depends on:
▪ Scale of the data: categorical or interval
▪ Statistic to be measured or compared: means
▪ Sampling distribution of such statistic: Normal Distribution or T
Distribution
▪ Number of variables to be measured or compared: univariate, bivariate and
multivariate
Hypothesis testing 79

Examples of test
▪ Means
▪ Chi-square for proportions
▪ T-test (One tailed test or Two tailed test)
▪ Analysis of Variance (ANOVA)
▪ Z-test

Calculate the test statistic


Test statistic used to define how far or how many standard deviations, a sample
mean is from the population mean. The larger the value of test statistic, the further
the distance or number of deviations, a sample is from the population mean stated in
the null hypothesis.
Tests commonly used are stated above; and are explained later on
Determine the Probability-value (Critical Value)
The p-value is the probability of obtaining results at least as extreme as the
observed results of statistical test assuming the null hypothesis is correct.
A p value is used in hypothesis testing to help you support or reject the null
hypothesis.
The p-value is used as an alternative to rejection points to provide the smallest level
of significance at which the null hypothesis would be rejected. A smaller p-value
means that there is stronger evidence in favor of the alternative hypothesis or the
smaller the p-value, the stronger the evidence that you should reject the null
hypothesis.
P-values are usually found using p-value tables or spreadsheets/ statistical software.
P values are expressed as decimals and converted to a percentage.
P-values at 5%, 1% and 0.1% (P < 0.05, 0.01 and 0.001) levels are commonly used
from different p-value tables or distribution tables
Commonly used software in calculating p-value
▪ Spreadsheets or Microsoft Excel
▪ SPSS – Statistical Packages for the Social Sciences
▪ SAS – Statistical Analysis System
▪ Stata
▪ R - (statistical Plus)
Mathematically, the p-value is calculated using integral calculus from the area under
the probability distribution curve for all values of statistics that are at least as far
from the reference value as the observed value is, relative to the total area under the
probability distribution curve.
For example
A p value of 0.0363 is 3.63%; means there is a 3.63% chance the results could be
random (i.e. happened by chance).
A large p-value of .58(58%) means the results have a 58% probability of being
completely random and not due to anything in the experiment or research. Thus, the
smaller the p-value, the more important (“significant“) the results.
80 Biostatistics: By Mwesigwa Wilson

Degrees of freedom
They are used to determine the critical value by comparing the value from test
calculation (using t-test, chi-square, ANOVA, z-test etc) with the corresponding
value at given confidence level
Degrees of freedom of an estimate is the number of independent pieces of
information that went into calculating the estimate. In order to get the degrees of
freedom (df) for the estimate, 1 subtracted from the number of items under
consideration
Degree of freedom for one sample t-test = n-1
Degree of freedom for two sample t-test = n1+n2 -1
Formula for t-test (explained later)

▪ X’ – mean of sample, µ- mean of comparison, s – sample deviation and n-


sample size
▪ The values of t-test is compared with degrees of freedom are got from t-
distribution tables compared to make conclusions in hypothesis testing
χ2 (Chi-Squared) Distribution: Critical Values of χ2
Significance level

Df 5% 1% 0.1%

1 3.841 6.635 10.828


2 5.991 9.210 13.816
3 7.815 11.345 16.266
4 9.488 13.277 18.467
5 11.070 15.086 20.515
6 12.592 16.812 22.458
7 14.067 18.475 24.322
8 15.507 20.090 26.124
9 16.919 21.666 27.877
10 18.307 23.209 29.588

t Distribution: Critical Values of t


Significance level
▪ First confidence interval shows significance level results for two-tailed test (first
percentage)
▪ Second confidence interval shows significance level one-tailed test (second
percentage)
Hypothesis testing 81

10% 5% 2% 1% 0.2% 0.1%


Df 5% 2.5% 1% 0.5% 0.1% 0.05%
1 6.314 12.706 31.821 63.657 318.309 636.619
2 2.920 4.303 6.965 9.925 22.327 31.599
3 2.353 3.182 4.541 5.841 10.215 12.924
4 2.132 2.776 3.747 4.604 7.173 8.610
5 2.015 2.571 3.365 4.032 5.893 6.869
6 1.943 2.447 3.143 3.707 5.208 5.959
7 1.894 2.365 2.998 3.499 4.785 5.408
8 1.860 2.306 2.896 3.355 4.501 5.041
9. 1.833 2.262 2.821 3.250 4.297 4.781
10 1.812 2.228 2.764 3.169 4.144 4.587
11 1.796 2.201 2.718 3.106 4.025 4.437
12 1.782 2.179 2.681 3.055 3.930 4.318
13 1.771 2.160 2.650 3.012 3.852 4.221
14 1.761 2.145 2.624 2.977 3.787 4.140
15 1.753 2.131 2.602 2.947 3.733 4.073
16 1.746 2.120 2.583 2.921 3.686 4.015
17 1.740 2.110 2.567 2.898 3.646 3.965
18 1.734 2.101 2.552 2.878 3.610 3.922
19 1.729 2.093 2.539 2.861 3.579 3.883
20 1.725 2.086 2.528 2.845 3.552 3.850
21 1.721 2.080 2.518 2.831 3.527 3.819
22 1.717 2.074 2.508 2.819 3.505 3.792
23 1.714 2.069 2.500 2.807 3.485 3.768
24 1.711 2.064 2.492 2.797 3.467 3.745
25 1.708 2.060 2.485 2.787 3.450 3.725
26 1.706 2.056 2.479 2.779 3.435 3.707
27 1.703 2.052 2.473 2.771 3.421 3.690
28 1.701 2.048 2.467 2.763 3.408 3.674
29 1.699 2.045 2.462 2.756 3.396 3.659
30 1.697 2.042 2.457 2.750 3.385 3.646
32 1.694 2.037 2.449 2.738 3.365 3.622
34 1.691 2.032 2.441 2.728 3.348 3.601
36 1.688 2.028 2.434 2.719 3.333 3.582
38 1.686 2.024 2.429 2.712 3.319 3.566
40 1.684 2.021 2.423 2.704 3.307 3.551
42 1.682 2.018 2.418 2.698 3.296 3.538
44 1.680 2.015 2.414 2.692 3.286 3.526
46 1.679 2.013 2.410 2.687 3.277 3.515
48 1.677 2.011 2.407 2.682 3.269 3.505
50 1.676 2.009 2.403 2.678 3.261 3.496
60 1.671 2.000 2.390 2.660 3.232 3.460
70 1.667 1.994 2.381 2.648 3.211 3.435
80 1.664 1.990 2.374 2.639 3.195 3.416
90 1.662 1.987 2.368 2.632 3.183 3.402
100 1.660 1.984 2.364 2.626 3.174 3.390
120 1.658 1.980 2.358 2.617 3.160 3.373
82 Biostatistics

10% 5% 2% 1% 0.2% 0.1%


Df 5% 2.5% 1% 0.5% 0.1% 0.05%
120 1.658 1.980 2.358 2.617 3.160 3.373
150 1.655 1.976 2.351 2.609 3.145 3.357
200 1.653 1.972 2.345 2.601 3.131 3.340
300 1.650 1.968 2.339 2.592 3.118 3.323
400 1.649 1.966 2.336 2.588 3.111 3.315
500 1.648 1.965 2.334 2.586 3.107 3.310
600 1.647 1.964 2.333 2.584 3.104 3.307
∞ 1.645 1.960 2.326 2.576 3.090 3.291

Note: Other distribution tables from official texts


Interpretation of results: Compare with the level of significance,  and determine
if the critical value falls in the rejection region
Reject or do not reject H0
Make or draw a conclusion based on the results

TESTS FOR SIGNIFICANCE

Tests used for testing hypothesis; classified into


▪ Parametric tests or standard tests for hypothesis
▪ Non-Parametric tests
Note
▪ Unpaired samples: here there is no relation between subjects or two
independent groups
▪ Paired samples: here the measures are repeated and taken on the same group
such as pre-post test
Parametric test
Parametric statistics used to make inferences about population parameters. These are
statistical tests base on the assumption that parameter is normally distributed. The
data is derived from interval and ratio measurement
Assumption
▪ Normally distributed data
▪ No outliers
▪ Homogeneity of variances
▪ Random sampling used
▪ Independence of data
▪ Interval or ratio level measurement
Examples include: Z-test, Student’s test, F-test, One way Analysis of Variance,
Regression, Pearson product moment correlation
Hypothesis testing 83

NOTE:
1. Most computer packages incorporate the tests especially SPSS
2. Degree of freedom is calculated and compared with the reference tables found in
the official books
3. This is an introduction on statistical tests; refer to official books for more
information
Goodness-of-Fit test
Before performing parametric test, check if the data is normally distributed by
plotting histogram and also goodness-of-fit test may be performed
They are statistical methods often used to make inferences about observed values; to
see how well sample data fit a distribution from a population with a normal
distribution or determines if sample data represents the data expected to be found in
the actual population.
Goodness-of-Fit tests can help determine if a sample follows a normal distribution,
if categorical variables are related, or if random samples are from the same
distribution.
Examples include
▪ The chi-square.
▪ Kolmogorov-Smirnov test
▪ Anderson-Darling test
▪ Shipiro-Wilk test
Chi Square
Chi-square is non-parametric test, developed by Karl Pearson
This is the most goodness-of-fit test; used for categorical data such as gender,
marital status, religion, color, race etc but not for numerical data or chi-square tests
involve checking if observed frequencies in one or more categories match expected
frequencies
The data used in calculating a chi-square statistic must be random, raw, mutually
exclusive, drawn from independent variables, and drawn from a large enough
sample
Chi square is abbreviated as χ is the Greek symbol Chi; χ2 chi square. Formula for
Chi-square

Where; c = Degrees of freedom O = Observed value(s) E = Expected value(s)


The chi-square statistic compares the size any discrepancies between the expected
results and the actual results
For these tests, degrees of freedom are utilized to determine if a certain null
hypothesis can be rejected based on the total number of variables and samples
within the experiment. The frequency of the observed values is measured and
subsequently used with the expected values and the degrees of freedom to calculate
chi-square.
84 Biostatistics

If the result is lower than alpha, the null hypothesis is invalid, indicating a
relationship exists between the variables
Chi square is non-parametric test
Application of chi-square test
Used in
▪ Test for homogeneity
▪ Goodness of frit of distribution
▪ Test for independence of parameters
Conditions for chi-square test
▪ Data must be inform of frequencies
▪ Observations must recorded on random basis
▪ Parameters must be independent
▪ Data must be organized into groups or categories with precise numerical
value
▪ Large sample size needed of about 50
Merits of chi-square test
▪ Can be applied for any distribution, either discrete or continuous, for which
the cumulative distribution function can be computed
▪ Can test association between variables
▪ Identifies difference between observed and expected values
▪ Easy and flexible calculation
▪ Provides detailed information
Limitations of chi-square
▪ Data must be numerical
▪ It requires sufficient sample size
▪ Test does not indicate cause and effect
▪ Data must be from random sample
▪ Difficult to interpret
Z-test
The test is used for comparing the mean of sample to some hypothesized mean for
the population in case of large sample; thus z-test is used when sample size is large
(≥30 elements) and when the standard deviation of population is known
𝑋− µ
Z= ; where
𝑆𝐸
X – mean of sample, µ - population mean
𝜎
SE – standard error = ; , σ-standard deviation of the population, n – sample size
√𝑛

Level of significance Z – value: Two tailed Z – value: One tailed


1% 2.25 2.35
5% 1.96 1.645
If the calculated test statistic is less than the appropriate critical table value, accept
the null hypothesis and reject the alternative hypothesis
If the calculated test statistic is equal or greater than the appropriate critical table
value, reject the null hypothesis and accept the alternative hypothesis
Hypothesis testing 85

Student’s t-test
This Any statistical hypothesis test in which the test statistic follows a Student’s t-
distribution if the null hypothesis is supported
t-test are applied where the sample is small and population standard deviation is
unknown. This is exactly like the z-test in computation; except of using the standard
deviation of population; use the standard deviation of the sample
t- test is used compare mean of two samples and in calculation of confidence
interval for sample mean
The t-statistic was introduced in 1908 by William Sealy
Student’s t-distribution
A family of continuous probability distributions that arises when estimating the
mean of a normally distributed population in situations where the sample size is
small and population standard deviation is unknown.
It plays a role in a number of widely used statistical analyses, including the
Student’s t-test for assessing the statistical significance of the difference between
two sample means, the construction of confidence intervals for the difference
between two population means, and in linear regression analysis.
The t-distribution can be used to estimate how likely it is that the true mean lies in
any given range
The t-distribution shows the degrees of freedom of different values, corresponds to
the different t-value at different confidence interval (explained above)
Assumption of t-test
▪ Samples are randomly selected
▪ Data is normally distributed
▪ Data variables are interval
Types or uses of t-test
▪ One-sample location t-test to compare one data sample to a hypothetical
distribution; used in measuring whether a sample value significantly differs
from a hypothesized value or whether the mean of a normally distributed
population has a value specified in a null hypothesis. The test applies to
continuous or non-continuous data that have a distribution that is not
significantly different from normal.

X’ – mean of sample, µ- mean of comparison


s – sample deviation, n- sample size
t-test for one sample is the most powerful parametric test for calculating the
significance of a small sample mean, <30. The test has the null hypothesis, or H0,
that the population mean equals the hypothesized value
86 Biostatistics

▪ Paired t-test to compare paired data samples: this is used to compare two
population means where there are two samples in which observations in one
sample can be paired with observations in the other sample; such as a
comparison of two different treatments where the treatments are applied to
the same subjects such as measure the size of a cancer patient’s tumor before
and after a treatment. This test applies to two paired samples of continuous or
non-continuous data that have distribution non-significantly different from
normal and similar standard deviations (≤ 2 difference)
▪ Independent t-test or unpaired t-test to compare unpaired data samples; this
aims to compare two unpaired data samples and applies to continuous or non-
continuous data that have an equal variance or a distribution normally
distributed data or not significantly different from normal. Sample sizes
should be similar (with ≤ 2 difference) for the two groups and, if n<30,
variances should also be similar (with ≤ 2 difference).
Used when two separate independent and identically distributed variables are
measured such as comparison of mean cholesterol levels in treatment group
with placebo group after administration of test drug or comparing
improvement in quality of life in patients who take drug A and those who
take drug B
This the widely used parametric test; a 2-sample t –test is used to establish
whether a difference occurs between the means of 2 similar data sets.
𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒 𝑜𝑓 𝑔𝑟𝑜𝑢𝑝 𝑎𝑣𝑒𝑟𝑎𝑔𝑒𝑠
t =
𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑒𝑟𝑟𝑜𝑟 𝑜𝑓 𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒

t is the t-value, x1 and x2 are the means of the two groups being compared, s2 is the
pooled standard error of the two groups, and n1 and n2 are the number of
observations in each of the groups.
Pooled standard deviation of the two groups

S21 – standard deviation for sample 1; S22 - standard deviation for sample 2
Pooled variance of the two groups
= √𝑃𝑜𝑜𝑙𝑒𝑑 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 (𝑆𝑝2)
A larger t-value shows that the difference between group means is greater than the
pooled standard error, indicating a more significant difference between the groups.
Assumptions for two sample t-test
▪ Data values are independent
▪ Simple random sampling was used to generate the sample
▪ Data is normally distributed
▪ Measurements of data are continuous
▪ The variances for the two groups are equal
Hypothesis testing 87

F-test
This is based on F-distribution of the statistic used to compare the variance of two-
independent samples. The test is used in the context of analysis of variance for
judging the significance of more than two sample means at one and the same time
The test statistic is calculated and compared with it’s probable value (seen in the F-
ration tables) for accepting or rejecting the null hypothesis
Analysis of Variance (ANOVA)
ANOVA is used to test hypothesis about the differences between two or more
means. There are two types
One-way ANOVA for unmatched samples
This is simplest type of ANOVA; where only one source of variation is investigated
in completely randomized experiment designs. It is an extension to three or more
samples of t-test procedure in 2 independent samples (unpaired student t-test is a
particular case of one-way ANOVA applied to two data samples) such as test of
effect of quality of several treatments of one cause of variation.
One way ANOVA is used to test the hypothesis that two or more samples are drawn
from the same distribution of values and have the same mean and variance.
Merits
▪ Very simple test to perform
▪ Reduce the experimental error to a great extent as in t-tests
▪ More variables can be investigated and suitable for laboratory experiment.
Two-way ANOVA for two-factor experiments
This type of test is used for experiments with two factors or two attributes or
variation
Non-parametric test
This is suitable for any continuous data, based on ranks of the data values. They do
not assume that data or population have any characteristic in common. Used for data
not normally distributed
Examples of non-parametric and corresponding parametric tests
Parametric Non-parametric
2 sample independent t –test Mann-Whitney U test
Dependent t – test for two Wilcoxan signed – rank test
samples
One way ANOVA Kruskal-Wallis one-way
Two way ANOVA Friedman two-way
McNemar’s, Fisher exact test
Pearson’s correlation Spearman’s rank correlation
Activity
Distinguish between parametric and non-parametric test
88 Biostatistics

EXAMPLES IN HYPOTHESIS TESTING

1. A school wants to compare students’ scores with the national average. A


simple random sample of 20 students who score an average of 50.2 on a
standardized test. Their scores have a standard deviation of 2.5. The national
average on the test is a 60. The school wants to know if students scored
significantly lower than the national average.
▪ Ho = 60 (null hypothesis: students have the scored per national average of
60)
▪ HA < 60 (alternative hypothesis: alternative hypothesis is that her students
scored lower than the national average)

µ = 60 x = 50.2 s= 2.5 n=20


𝑠 2.5
= = = 0.559
√𝑛 √20
50.2−60 −9.8
t= = = -17.5 (observed t statistic)
0.559 0.559
Degree of freedom = n-1 = 20 -1 = 19.
From students distribution at 0.95, T(19) = 1.729
The p-value is 19 approximated at 1.777
Reject null hypothesis, the students did score significantly lower than
national average
2. A neurologist is testing the effect of drug response on response time by
injection 100 rats with a unit dose of the drug, subjecting each to
neurological stimulus and recording its response time. The neurologist
knows that the mean response time for rats not injected with drug is 1.2
seconds. The mean response time for the injected rats is 1.05 with sample
deviation of 0.05 rats. At 95% confidence interval, do you think the drug has
an effect on response time?
Ho: µ=1.2 seconds
HA: µ≠1.2 seconds
µ = 1.2 sec, X = 1.05 sec, n=100 rats, SD = 0.05

𝑠 0.5
= = = 0.05
√𝑛 √100
1.05−1.2
t= =-3
0.05
degree of freedom: n-1 = 100 -1 = 99
from the t distribution; at 95% or 5% or 0.05 CI = 99 ≈ 1.660
The magnitude of the critical value is 1.660 and the magnitude of the
calculated t-value is - 3. Since the calculated test statistic is less than the
critical value, there is sufficient information to reject the null hypothesis.
Hypothesis testing 89

3. A certain brand of insulin was advertised as having an average self-life


without refrigeration of 2,500 hours. A random sample of 50 products had a
mean shelf-life of 2,470 hours and a standard deviation of 140 hours. With a
0.05 level of significance, is the sample mean less than the advertised mean?
Note: t-test; potentially used when the population standard deviation is not
known.
Advertised average shelf-life μ=2500hours, n = 50, sample mean -
2470 hours, sample standard deviation, 40 hours.
H0 = 2500, HA<2500,

𝑠 140
= = = 19.799
√𝑛 √50
2470−2500
t= = −1.5152
19.799
The degrees of freedom associated with the above test statistic are: df=n−1 =
50−1=49
The critical t-value at 0.05 level of significance and with 49 degrees of
freedom is obtained from the t-distribution table test as 1.676

The level of significance α = 0.05, the critical values from the t distribution
on 49 degrees of freedom are -1.676 to 1.676. The calculated t does not
exceed these values, hence the null hypothesis cannot be rejected with 95
percent confidence.
Or
The magnitude of the critical value is -1.676 to 1.676 and the magnitude of
the calculated t-value is -1.515. Since the calculated test statistic is less than
the critical value, there is insufficient information to reject the null
hypothesis.
4. Test the hypothesis that a sample of size n = 25 with mean x = 79 and
standard deviation s = 10 was drawn at random from a population with mean
μ = 75 and unknown standard deviation.
Note: t-test is used since standard deviation of population is unknown
Ho = 75

Sample mean x = 79, mean sample = 75, standard deviation = 10, n= 25


𝑠 10
= = =2
√𝑛 √25
79−75 4
t= = =2
2 2
Degrees of freedom = n-1 = 25-1 = 24
Common level of significance α = 0.05, the critical values from
the t distribution on 24 degrees of freedom are −1.711 and 1.711. The
calculated t does exceed these values, hence the null hypothesis is rejected
with 95 percent confidence.
90 Biostatistics: By Mwesigwa Wilson

5. Estimations of plasma calcium concentration in previous studied in


population gave a mean of 3.2 mmol/l, with standard deviation 1.1. An
investigations in a sample of 18 patients was a mean close to 2.5 mmol/l. Is
the mean in these patients abnormally high? Consider 0.5% or 0.005

Sample mean x = 2.5 mmol/l, mean sample of population = 3.2 mmol/l,


standard deviation = 1.1, n= 18
𝑠 1.1
= = = 0.259
√𝑛 √18
2.5−3.2 −0.7
t= = = - 2.7
0.259 0.259
Degrees freedom = n-1 = 18-1 = 17
Common level of significance α = 0.005, the critical values from
the t distribution on 17 degrees of freedom are −2.898 and 2.898. The
calculated t does exceed these values
6. In a random sample of 25 women, the mean term of the pregnancy was
271.2 days According to the records from MOH, the national mean term of
pregnancy is 268 days with standard deviation of 15. One doctor says the
same has the same mean pregnancy while the other says they have different.
At 95%, interval, which hypothesis is true
7. Assume the mean systolic blood pressure of students in a class is 130mmHg.
A random sample of 64 students had systolic blood pressure of 132 mmHg
with a standard deviation of 10 mmHg. Does the mean systolic blood
pressure group of students differ from that of the students in school at a 5%
Level of significance?
8. The serum concentration of 18 patients was measured and samples mean
was 3.1mmol/l with standard deviation of 1.2mmol/l. The mean calcium
concentration of the population was reported to be 2.5mmol/l. Compare the
sample mean and population; make a report
9. In comparative study of body fat in men and women, the following data was
obtained
n Mean SD
Women 10 22.29 5.32
Men 13 14.95 6.84
Do the study subjects have the same mean body fat?
Note: Two-sample t-test an appropriate method to evaluate the difference in
body fat between men and women
Ho = mean body fat for women equal to men
HA = Mean body fat not equal for men and women

Pooled variance of the two groups


= √𝑃𝑜𝑜𝑙𝑒𝑑 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 (𝑆𝑝2)
x1 - mean of body fat for women – 22.29, n1 - 10
x2 - mean of body fat for men - 14.95, n2 - 13
S21 – standard deviation for women – 5.32
S22 - standard deviation for men - 6.84
Hypothesis testing 91

Pooled standard deviation


((10−1)5.32𝑥5.53)+((13−1) 6.84 𝑥 6.84)
S2 p = = = 38.88
10+13−2
Pooled variance of the two groups = √38.88 = 6.24
22.29−14.95
t value = 1 1
= 2.8
6.24 𝑥 √( + )
10 13
degree of freedom= n1+n2 -1 = 23-2 = 21
t value with α = 0.05 and 21 degrees of freedom is 2.080.
t(value) = 2.8 > 2.08. The calculated t does exceed these values, hence the
null hypothesis is rejected with 95 percent confidence

Note
In other following calculations, one has to calculate the mean and standard deviation

10. In comparative study of male and female patients reported the following data
on inspiratory vital capacity
Sample size (n) Mean Standard deviations
Men 16 2.8 0.8
Women 25 1.9 0.5
Do the patients have the same mean inspiratory vital capacity?

11. Engineers at pharmaceutical industry wish two new tableting machines (A and
B). 8 equally batches of manufacture were run by each machine and the
following data obtained. Data in 1million tablets
A 5.1 6.5 3.6 5.5 5.7 4.3 3.8 6.4
B 4.8 6.4 3.1 5.5 5.5 4.4 3.6 5.9
Test at 1% level of significance, the hypothesis that the mean with machine B
is less mean of production with machine A

12. The following are results weights of patients visiting two pharmacies. Are the
mean weights of the two samples from the two pharmacies similar
Pharmacy A 55 53 60 71 96 41 49 91
Pharmacy B 51 64 98 80 79 79 58 83
92 Biostatistics

CORRELATION AND REGRESSION

CORRELATION ANALYSIS

Correlation is the relationship or association between two quantitatively measured or


continuous variables
Correlation is applied in quantifying the association between two continuous
variables. For example, an dependent and independent variable or among two
independent variables; such as weight and cholesterol, weight and height,
temperature and pulse rate etc.
Correlation coefficient is extent or degree of relationship between two sets of
variables; it is denoted by ‘r’. The extent of correlation varies between minus one
and plus one, i.e. –1 < r <1
Correlation determines the relationship between two variables, but it does not prove
that one particular variable alone causes the change in the other as the cause of
change in the same or opposite direction may be due to other reacting factor or
factors.
Types of correlation
Correlation can be classified on the basis of;
▪ Degree of correlation: Positive, Negative and Absolutely no correlation
▪ Number of variables: Simple, multiple and multiple correlation
▪ Linearity: Linear and non-linear correlation
Positive correlation
This is when the value of one variable increases with respect to another.
There two subtypes
▪ Perfect positive correlation
▪ Moderately positive correlation
Perfect positive correlation: In this type, the variables are directly proportional and
fully correlate with each other; i.e both variables rise or fall in the same proportion.
The correlation coefficient (r) = +1. When scatter diagram is drawn, all the points
fall on this straight line

Moderately positive correlation: values of coefficient (r) lie between 0 and +1 i.e. 0
< r < 1. These are most type common type of correlation such as temperature and
pulse rate. When a scatter diagram is drawn; there will be an imaginary mean line
between the values as they raise
Correlation analysis 93

Negative correlation: when the value of one variable decreases with respect to
another. There two sub types
Perfect negative correlation: when values of the two variables are inversely
proportional to each other, i.e. when one rises, the other falls in the same proportion,
i.e. the correlation coefficient (r) = –1. When scatter diagram is drawn, all the points
fall on this straight line or the graph will contain all the observations on a straight
line as one (no scatter) as from either of the extreme ends because one variable
rises and the other falls in a fixed proportion
Such examples may include; temperature and number of colds in winter

Moderately negative correlation: In the scatter diagram, mean imaginary line will is
drawn between the values of variable. The values of coefficient (r) lie between –1
and 0, i.e. –1 < r < 0

Absolutely no correlation: This is when there is no linear dependence or no


relation between the two variables and the value of correlation coefficient is zero.
When values are drawn on a scatter diagram, no imaginary line can be drawn
94 Biostatistics

Linear correlation
Linear correlation exists if the ratio of change in two variables is a constant or if the
amount of change in one variable tends to bear a constant ratio to the amount of
change in other variable.
If the values are plotted on a graph, the result is a straight line

Non-linear correlation
This is also referred to as curvilinear correlation. This is when the amount / quantity
of change one variable does not bear a constant change in the amount / quantity of
change in the other variable

Simple correlation
This occurs when correlation is studied between two variables. Such as advertising
of commodities and sales made, price and quantities sold etc
Multiple correlation
This occurs when correlation is considered among three or more variables
simultaneously
Partial correlation
When one or more variables are kept constant and the correlation or relationship is
studied between other variables
DETERMINATION OF CORRELATION COEFFICIENT
Different methods are used to determine correlation coefficient; such as
▪ Pearson’s correlation coefficient
▪ Scatter diagrams
▪ Spearman’s Rank correlation
Correlation analysis 95

SCATTER DIAGRAMS
This is the simplest method of studying correlation between two variables (x and y)
Draw scatter diagrams
1. With appropriate scale, the two variables x and y are taken on the X and Y
axes of a graph paper. With each pair of x and y value, mark a dot and get as
many points as the number of pairs of observation.
2. Draw the straight line (best line of fit).
3. If all the plotted points lie on a straight line rising from the lower left-hand
corner to the upper right-hand corner, correlation is said to be perfectly
positive and if some of the plotted points fall in a straight line, there is
medium to weak positive correlation between variables.
4. If all the plotted points lie on a straight line falling from the upper left-hand
corner to the lower right-hand corner of the diagram, correlation is said to be
perfectly negative and if some of the plotted points fall in a straight line,
there is medium to weak negative correlation between variables.
5. If the plotted points lie scattered all over the diagram, there is no correlation
between the two variables.

The scatter diagram is simple and non-mathematical method of studying correlation


between variables. However, the method cannot measure the exact degree of
correlation between the variables
Question
Given the following data on pairs of value of the variables X and Y
X 46 68 74 72 80 70 90 100 103
Y 67 48 49 37 26 55 55 35 46
a) Draw a scatter diagram and draw an estimated line
b) Is there any correlation between the variables x and y? positive or
negative; high or low
96 Biostatistics

PEARSON’S CORRELATION COEFFICIENT


This is also referred as Karl Pearson’s Product-Moment Correlation Coefficient (the
Pearson’s Correlation Coefficient) is a measure of the strength of a linear association
between two variables and is denoted by r or rxy(x and y being the two variables
involved). This method was developed by Professor Karl Pearson - a British
Statistician; the most commonly used
The Pearson’s Correlation Coefficient formula is expressed as
𝑐𝑜𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 (𝑥,𝑦)
r=
𝑆.𝐷(X) 𝑆.𝐷(𝑦)

Or

n = number of observation in pairs, dx = deviation of variable x


dy = deviation of variable y,
∑dx. dy = implies summation of multiple dx and dy
∑dx = summation of dx, ∑dy = summation of dy
∑dx2 = summation of square of dx, ∑dy2 = summation of square dy
The above method is referred as assumed mean method
The actual mean method for Karl Pearson formula: It is laborious because mean
has to be found first

X – value of X series
Y – value of Y series
The above formulas are used to find the correlation coefficient for the given data.
Based on the value obtained through these formulas, one can determine how strong
is the association between two variable
Properties of Correlation Coefficient
r is interpreted using the following properties
1. The value of r ranges from – 1 to 1
2. A positive value of r shows a positive correlation between the two variables
3. A negative value of r shows a negative correlation between the two
variables
4. A value of r = 1 indicates that there exists perfect positive correlation
between the two variables
Correlation analysis 97

5. A value of r = - 1 indicates that there exists perfect negative correlation


between the two variables
6. A value r = 0 indicates zero correlation i.e., it shows that there is no
correlation at all between the two variables.
7. A value of r = 0.9 and above indicates a very high degree of positive
correlation between the two variables
8. A value of - 0.75 ≥ r > - 1.0 shows a very high degree of negative
correlation between the two variables
9. For a reasonably high degree of positive correlation, we require r to be from
0.75 to 1.0
10. A value of r from 0.6 to 0.75 may be taken as a moderate degree of positive
correlation.
Terms used to define relationship between variables
▪ Strength: This is measured by r. It gives the degree or extent to how
consistently one variable will change due to the change in the other. Values
that are close to +1 or -1 indicate a strong relationship – perfect positive or
perfect negative respectively. Any correlation value between 0 and +1 or 0
and -1 indicates moderately/partially/ weakly positive or negative. A
correlation of 0 shows no relationship
▪ Direction of the relationship: This is indicated by the sign of the
correlation; positive or negative. In positive correlation, the two variables
change in the same upward direction ie an increase in in value of one lead to
an increase in the value of the other; shown by upward slope in the scatter
diagram. A negative correlation depicts a downward slope. This means an
increase in the amount of one variable leads to a decrease in the value of
another variable
Limitation of correlation
Although correlation is a powerful tool, there are some limitations in using it:
1. Outliers (extreme observations) strongly influence the correlation
coefficient. The outliers may be dropped before the calculation for
meaningful conclusion.
2. Correlation does not imply causal relationship. That a change in one
variable causes a change in another. This may be due pure chance,
influenced by one or more variables both the variables may be mutually
influencing each other so that neither can be designated as the cause and the
other the effect.
Assumptions
The following assumptions are made in calculation of Pearson‘s Correlation
Coefficient
▪ Outliers either to a minimum or removed entirely
▪ There is a linear relationship (or any linear component of the relationship)
between the two variables
98 Biostatistics

1. The following data was obtained from number of cigarette smoking subjects as
the number of years lived
Sno 1 2 3 4 5 6 7 8 9
Cigarettes per week 25 35 10 40 85 75 60 45 50
No. years lived 63 68 72 62 65 46 51 60 55
Calculate the correlation of coefficient between the number of cigarettes
smoked per week in the last 5 years and the longevity of a test subject
x – the number of cigarettes smoked. y – years lived. n = 9
dx dx2 dy dy2 dxy
25 625 63 3969 1575
35 1225 68 4624 2380
10 100 72 5184 720
40 1600 62 3844 2480
85 7225 65 4225 5525
75 5625 46 2116 3450
60 3600 51 2061 3060
45 2025 60 3600 2700
50 2500 55 3025 2750
∑dx = 425 ∑dx = 24525 ∑dy = 542
2
∑dy =
2
∑dx. dy
33188 = 24640

9 𝑥 24640−425 𝑥 542
r=
√9 𝑥 24525−(425 𝑥 425).√9 𝑥33188−(542 𝑥 542)

r = - 0.61
This implies a negative correlation between the considered variables i.e. The
higher the number of cigarettes smoked per week in last 5 years, the lesser the
number of years lived. Note that it DOES NOT mean that smoking cigarettes
decreases the life span. Because, many other factors might be responsible for
one’s death
2. The following data was obtained from research of the heights (in inches) of
father and their eldest son from a village in Fort Portal. Compute the
correlation coefficient
Height fathers 66 68 63 67 64 69 72 68 70
Height of son 68 72 65 65 65 72 71 67 69
3. Find Karl Pearson‘s coefficient of correlation between the values of X and Y.
Find probable error (refer at the next subtopic)
X 46 68 74 72 80 70 90 100 103
Y 67 48 49 37 26 55 55 35 46
Correlation analysis 99

4. You are provided with the following data on students correct score and their
attitude. Calculate Pearson correlation coefficient (use actual mean method)
Correct score 17 13 12 15 16 14 16 16 18 19
Attitude 94 73 59 80 93 85 66 79 77 91
Let correct score be “x” and attitude be “y”
Y
X square square
17 94 1.4 14.3 20.02 1.96 204.49
13 73 -2.6 -6.7 17.42 6.76 44.89
12 59 -3.6 -20.7 74.52 12.96 428.49
15 80 -0.6 0.3 -0.18 0.36 0.09
16 93 0.4 13.3 5.32 0.16 176.89
14 85 -1.6 5.3 -8.48 2.56 28.09
16 66 0.4 -13.7 -5.48 0.16 187.69
16 79 0.4 -0.7 -0.28 0.16 0.49
18 77 2.4 -2.7 -6.48 5.76 7.29
19 91 3.4 11.3 38.42 11.56 127.69
∑x= ∑y= ∑=
156 797 ∑= 134.8 42.4 ∑= 1206.1

15.6 79.7

134.8
r= = 0.596
√42.4𝑥1206.1

5. The following are the marks scored by 8 students in two tests in a


pharmaceutics. Calculate coefficient of correlation from the following data and
interpret.
Test-1 11 12 8 11 14 13 11 7 13
Test-2 12 13 7 10 13 11 12 6 12
6. The following data was obtained from pharmaceutical company on amount
spent on advertising and sales (in millions) made in 8 months. Determine the
correlation coefficient between and interpret the result.
Advertising amount 24 22 19 20 18 19 18 23
Sales 18 19 16 18 19 20 16 20
Probable error of coefficient of Correlation
Probable error of the coefficient of correlation is a statistical measure which
measures reliability and dependability of the value of coefficient of correlation.
If probable error is added to or subtracted from the coefficient of correlation it gives
two such limits within which we can reasonably expect the value of coefficient of
correlation to vary
100 Biostatistics: By Mwesigwa Wilson

Usually, the coefficient of correlation is calculated from samples. For different


samples drawn from the same population, the coefficient of correlation may vary.
But the numerical value of such variations is expected to be less than the probable
error
The formula for calculating probable error is:

Where, 0.6745 is a constant number ‘r’ stands for correlation coefficient and ‘n’
number of pairs of observation

If r = 0.61 and n = 6, find probable error and standard error

SPEARMAN RANK CORRELATION COEFFICIENT


The Spearman’s rank coefficient of correlation is a nonparametric measure of rank
correlation or a new method of measuring the correlation between two variables.
Instead of taking the values of the variables, the ranks (or order) of the observations
are considered and coefficient of correlation for the ranks is calculated
Named after Charles Spearman, it is often denoted by the Greek letter ‘ρ’ (rho) or rs
and is primarily used for data analysis for qualitative characteristics such as
intelligence, beauty, morality, honesty etc
It measures the strength and direction of the association between two ranked
variables.
Formula for Spearman rank correlation coefficient:

n = number of data points of the two variables or number of pairs of observations


d = difference in ranks R1 – R2
6 – constant
The Spearman Coefficient,⍴, value between ranges +1 to -1 where,
▪ A ⍴ value of +1 means a perfect association of rank
▪ A ⍴ value of 0 means no association of ranks
▪ A ⍴ value of -1 means a perfect negative association between ranks.
▪ Closer the ⍴ value to 0, weaker is the association between the two ranks.
The steps in calculation showed in the following example
Correlation analysis 101

Question 1
The scores of 9 students in biostatistics and research methodology are mentioned in
the table below.
Biostatistics 35 23 47 17 10 43 9 6 28
Research 30 33 45 23 8 49 12 9 31
1. Start by ranking the two data sets. Data ranking can be achieved by
assigning the ranking “1” to the biggest number in the column, “2” to the
second biggest number and so forth. The smallest value usually get the
lowest ranking. This is done for both sets of data
2. Create a table of the data of at least 6 columns
3. In another column “d”; d denotes the difference between ranks (R1 – R2).
4. In the f column,”d2” square your d values.
5. Add up all d square values, to obtain “(∑d2)”
6. Insert the values in the formula
Biostatistics Rank R1 Research Rank R2 d d2
35 3 30 5 2 4
23 5 33 3 2 4
47 1 45 2 1 1
17 6 23 6 0 0
10 7 8 8 1 1
43 2 49 1 1 1
9 8 12 7 1 1
6 9 4 9 0 0
28 4 31 4 0 0
∑d =12
2

6 𝑥 12
=1 - = 0.9
9 (81−1)

The Spearman’s Rank Correlation for this data is +0.9 and as mentioned above if
the ⍴ value is nearing +1 then they have a perfect association of rank.
Note:
While assigning rank, if two or more items have equal values (i.e., if there occur a
tie), they may be given mid rank. Thus, if two items are on the fifth rank, each may
ranked as (5 + 6) /2 = 5.5 and the next item in the order of size would be ranked
seventh. In there are 3 items item with fifth rank, each is rank 5 (15/3) and the next
item is ranked Eighth
When two or more ranks are equal, the following formula is used for computing
rank correlation

Where, m stands for the number of equal ranks. The term (m3 – m) is to be added in
the numerator for each group of equal rank both in x and y series
102 Biostatistics

2. You are provided with the following data set


X 68 64 75 50 64 80 75 40 55 64
Y 62 58 68 45 81 60 68 48 50 70
Calculate the rank correlation coefficient
X Y R1 R2 d d2
68 62 4 5 -1 1
64 58 6 7 -1 1
75 68 2.5 3.5 -1 1
50 45 9 10 -1 1
64 81 6 1 5 25
80 60 1 6 -5 25
75 68 2.5 3.5 -1 1
40 48 10 9 1 1
55 50 8 8 0 0
64 70 6 2 4 16
∑d2 =72

In the above table, one can see that: m3 – m


75 occurs 2 times, m = 2 23 – 2 = 6
64 occurs 3 times, m = 3 33 – 3 =24
68 occurs 2 times, 23 – 2 = 6
∑ m3 – m = 36
36
6 (72+ )
12
rs = 1 - = 0.545
10 (100−1)

3. Calculate Spearman‘s rank Correlation coefficient between the values of X and


Y.
X 46 68 74 72 80 70 90 100 103 90
Y 67 48 49 37 26 55 55 35 46 47
4. Below are given the heights of fathers (X), and those of their sons (Y) in
centimeters. Calculate Spearman‘s rank Correlation coefficient.
X 180 155 170 174 160 172 166
Y 170 165 180 180 164 169 172
5. Calculate the Pearson Correlation between height and weight of the following
data. Explain the results
Height 159 155 163 172 179 159 156
Weight 54 165 56 70 76 53 47
Regression analysis 103

REGRESSION ANALYSIS

Regression analysis refers to assessing the relationship between the outcome


variable and one or more variables
Regression means change in the measurements of a variable character, on the
positive or negative side, beyond the mean
Regression coefficient is a measure of the change in one dependent (Y) character
with one unit change in the independent character (X). The dependent variable is
shown by “y” and independent variables are shown by “x” in regression analysis.
The outcome variable is known as the dependent or response variable and the risk
elements, known as predictors or independent variables.
Linear Regression
This is basic and commonly used type of predictive analysis. Linear regression is a
linear approach to modelling the relationship between one variable and one or more
independent variables.
▪ Simple linear regression: If the regression has one independent variable; i.e
one dependent variable and one independent variable
▪ Multiple linear regression: If regression has more than one independent
variable i.e one dependent variable and two or more independent variables
Properties of a regression coefficient
▪ The regression coefficient is denoted by b.
▪ The regression coefficient of y on x can be represented as b yx. The regression
coefficient of x on y can be represented as bxy. If one of these regression
coefficients is greater than 1, then the other will be less than 1.
▪ The arithmetic mean of both regression coefficients is greater than or equal to
the coefficient of correlation.
▪ They are not independent of the change of scale. They will change in the
regression coefficient if x and y are multiplied by any constant.
▪ The geometric mean between the two regression coefficients is equal to the
correlation coefficient.
If bxy is positive, then byx is also positive and vice versa.
Regression Coefficient
In the linear regression line, the equation is given by:
Y = a + bX
Y - dependent variable
a – Y intersect (from graphical representation of data)
b – slope
X – independent variable
𝑆𝐷 𝑌
Formula for slope b = r ; r – pearson correlation
𝑆𝐷 𝑋
For for Y intersect a = y’ – bx’; y’ –mean of y sample and x’ – mean of x sample
NOTE
▪ The above method is too laborious
▪ If means and r are not to be calculated. Refer to standard books
104 Biostatistics

Regression coefficient of Y on X is denoted as byx (variation in Y when changes in X


is given) and that of X on Y as bxy (variation in X when changes in Y is given). The
following formulas are used
𝑆𝐷 𝑜𝑓 𝑌 𝑖𝑡𝑒𝑚𝑠
byx = r x
𝑆𝐷 𝑜𝑓 𝑋 𝑖𝑡𝑒𝑚𝑠
𝑆𝐷 𝑜𝑓 𝑋 𝑖𝑡𝑒𝑚𝑠
bxy = r x
𝑆𝐷 𝑜𝑓 𝑌 𝑖𝑡𝑒𝑚𝑠

If means are already calculated, the regression coefficients are:

If means are not to be calculated, a simple and direct method is adopted as indicated
below:

Question1
You are provided with the following data on students correct score and their
attitude. Calculate the attitude if the student has a score of 11
Score 17 13 12 15 16 14 16 16 18
Attitude 94 73 59 80 93 85 66 79 77
Let correct score be “x” and attitude be “y”
Y = a + bX
Y - dependent variable, a – Y intersect (from graphical representation of data), b –
slope and X – independent variable
𝑆𝐷 𝑌
b=r ; r – pearson correlation
𝑆𝐷 𝑋
a = y’ – bx’; y’ –mean of y sample and x’ – mean of x sample
Let correct score be “x” and attitude be “y”
Regression analysis 105

Y
X square square
17 94 1.4 14.3 20.02 1.96 204.49
13 73 -2.6 -6.7 17.42 6.76 44.89
12 59 -3.6 -20.7 74.52 12.96 428.49
15 80 -0.6 0.3 -0.18 0.36 0.09
16 93 0.4 13.3 5.32 0.16 176.89
14 85 -1.6 5.3 -8.48 2.56 28.09
16 66 0.4 -13.7 -5.48 0.16 187.69
16 79 0.4 -0.7 -0.28 0.16 0.49
18 77 2.4 -2.7 -6.48 5.76 7.29
19 91 3.4 11.3 38.42 11.56 127.69
∑x= 156 ∑y= 797 ∑= 134.8 ∑= 42.4 ∑= 1206.1
15.6 79.7

134.8
r= = 0.596
√42.4𝑥1206.1

42.4
SD X = √ =√ = 2.17
𝑛−1 10−1

1206.1
SD Y = √ =√ = 11.57
𝑛−1 10−1
𝑆𝐷 𝑌 11.57
b=r = 0.596 x = 3.18
𝑆𝐷 𝑋 2.17
a = y’ – bx’ = 79.7 – (3.18 x 15.6) = 30.01
Y = a + bX
Y = 30.01 + (3.18 x 11)
Y= 64.99
Differences between Correlation and regression.
▪ Correlation gives the degree and direction of relationship between the two
variables, whereas the regression analysis enables us to predict the values of
one variable on the basis of the other variable. Thereby, the cause and effect
relationship between two variables
▪ Correlation shows the quantity of the degree to which two variables are
associated. Linear regression finds the best line that predicts y from x, but
Correlation does not fit a line.
▪ Correlation is used during measure of both variables, while linear regression
is mostly applied when x - independent variable is manipulated.
106 Biostatistics

REFERENCES

Arun Bhadra Khanal. Mahajan's Methods in Biostatistics for Medical Students and
Research Workers. 8th Edition. Jaypee Brothers Medical Publishers (P) Ltd
Bratati Banerjee. Mahajan's Methods in Biostatistics for Medical Students and
Research Workers. 9th Edition. Jaypee Brothers Medical Publishers (P) Ltd
K.Visweswara Raoetal, Biostatistics
Marc M. Triola & Mario F. Triola. Biostatistics for the Biological and Health
Sciences with Statdisk. First edition 2014. Pearson New International Edition
T.D.V. Swinscow and M.J. Campbell. Statistics at Square One.
Wayne W. Daniel. Biostatistics. A Foundation for Analysis in the Health Sciences.
9th Edition. John Wiley & Sons, Inc

You might also like