Biostatisticsfor Pharmacy Students
Biostatisticsfor Pharmacy Students
Statistics interrelate with various sectors including healthcare. Pharmacy is a field that
relies on science and patient care. As the world transforms, there is a growing need for
evidence based medicine and pharmacists are playing a more important role than before.
A critical aspect of this task is the capacity to understand and apply statistical data to their
maximum potential in order to achieve desired patient results.
This book aims to teach pharmacy practitioners, researchers, and students about the basic
concepts of pharmacy statistics. The emphasis here is on daily pharmacy activities like
designing clinical trials, measuring drug impacts, or examining patient records.
Understanding how to use data is vital to everyday pharmacy practice and that is why it is
essential to possess statistical knowledge.
Statistics is one such area that may seem interesting but complex, especially in the case of
pharmaceutical research. The aim of this text is to make these intriguing concepts easy to
understand. This book tries to construct and STEM pedagogies blend with non-traditional
concepts to foster understanding of how to use statistical concepts to address pharmacy
issues.
In these particular chapters, we will present certain statistical methods which are integral
to activities associated with pharmaceutical research, drug development, clinical studies,
and other pharmacy applicative areas. The provided examples aim to address particular
issues which pharmacy practitioner’s face, giving them a way through the statistical
information that affects their practice.
We strive that this book not only serves as a reference material but also as a motivational
document to enable you to enhance your competencies in understanding and using
statistical data. It is through these methodologies that one would meaningfully support
the field of pharmacy practice, the patient care services, and impact the healthcare system
through significant research activities.
INDEX
4 Co relation 69-87
5 Regression 88-103
6 Probability 104-146
1. Introduction
statistical tests, for example, categorical data, which would include success or failure
in drug treatment, or continuous, which, for example, could describe the level of drug
concentration.
Common Statistical Tests in Inferential Statistics:
T-test: Compares the means of two groups (e.g., evaluating the average
cholesterol levels between two groups receiving different treatments).
ANOVA (Analysis of Variance): Compares means across three or more groups
(e.g., comparing the efficacy of three different drug dosages).
Chi-Square Test: Analyzes the difference between observed and expected
frequencies in categorical data (e.g., examining if adverse effects are distributed
differently between drug groups).
Correlation and Regression Analysis: Investigates the relationships between two
or more variables.
Pearson Correlation: Measures the strength and direction of a linear
relationship between two continuous variables.
Linear Regression: Predicts the value of one variable based on another.
Application in Pharmacy: Inferential statistics can be used to determine if a new
medication leads to significant improvements in health outcomes compared to an
existing treatment, or to forecast a patient’s response to therapy based on clinical data.
1.3.Data Collection
Pharmaceutical students often rely on a number of data collection instruments,
designed to suit their purposes and objectives of the research, nature of the study and
resources available. The following are some of the most basic considerations that
pharmaceutical research should be aware of during the entire process of data
collection:
Ethics Approval: Most research projects, especially if they involve human
subjects, such as clinical trials, surveys, or interviews need to be approved by an
IRB or an ethics committee so that it is clearly in compliance with the main
principles of good practices in doing research.
Data Integrity: Data needs to be accurate and reliable; this would be irrespective
of if it is quantitative or qualitative data.
Ordinal Data: This is data that not only permits categorization, but also ranks or
places values in order. Still, the rank has no equal interval or constant difference
between any two consecutive ranks. Examples of ordinal data in pharmacy include:
Severity of Side Effects (e.g., mild, moderate, severe)
Pain Intensity (e.g., none, mild, moderate, severe)
Disease Staging (e.g., early-stage, mid-stage, late-stage)
b. Quantitative (Numerical) Data
Quantitative data is data that can be expressed in numerical terms, for measurable
quantities. This type of data is typically analyzed using mathematical and statistical
methods and consists of two principal and typified categories:
Discrete Data: This type of data consists of distinct, separate values, often
representing counts or whole numbers. Examples in pharmacy include:
Number of Tablets Prescribed
Frequency of Adverse Events Reported
Total Medication Doses Administered
Continuous Data: Unlike discrete data, continuous data can take any value within a
specified range and can be subdivided into finer increments. Examples in pharmacy
include:
Blood Pressure Readings (e.g., 120/80 mmHg)
Serum Drug Concentration Levels
Body Weight (e.g., 75.2 kg)
1.4.2. Scales of Measurement
There are four basic scales of measurement, each defining how measurements should
be ranked and the math operations appropriately allowed:
a. Nominal Scale
The nominal scale defines data as being categorised into distinct groups or labels,
without any inherent order or ranking. The values are mere identifiers and convey no
quantitative or ranking sense.
Examples in pharmacy:
Medication type (e.g., generic vs. brand-name)
Prescription status (e.g., filled vs. unfilled)
Medication adherence (e.g., yes/no)
b.Ordinal Scale
This scale includes data that can be ranked in a particular order but the intervals may
be inconsistent or cannot be measured.
Examples in pharmacy:
Pain intensity (e.g., none, mild, moderate, severe)
Degree of medication adherence (e.g., low, medium, high)
Classification of side effects (e.g., none, mild, moderate, severe)
c.Interval Scale
The data has equal and meaningful intervals between values but no absolute zero
point. Differences between values are measurable, but values such as twice as much
lack meaning.
Examples in pharmacy:
Temperature (e.g., Celsius or Fahrenheit) – there is no absolute zero in these
scales.
Serum drug concentration levels – while the scale is continuous, the absence
of a true zero concentration classifies it as interval data.
a. Ratio Scale
The ratio scale features data with both equal intervals and an absolute zero point,
enabling a full range of mathematical operations, including addition, subtraction,
multiplication, and division.
Examples in pharmacy:
Body weight (e.g., 70 kg, 75 kg) – a weight of zero indicates the complete
absence of weight.
Height (e.g., 1.75 m)
Drug dosage (e.g., 500 mg)
1.5 Data Organization and Presentation
In Organizing and presenting data systematically within pharmaceutical research is
considered a necessary requirement for reproducibility, transparency, and proper
understanding by several stakeholders such as researchers, regulatory authorities, and
practitioners.
Dose No. of
Strength(mg) Patients(frequency)
200 10
400 15
600 30
800 20
1000 5
35
30
No. of Patients
(frequency)
25
20
15
10
5
0
200 400 600 800 1000
Dose Strength (mg)
70
60
No. of Patients
50
40
Mild
30
Moderate
20
Severe
10
0
Mild Moderate Severe
Level of Severity
patient population, revealing whether most patients maintain concentrations within the
therapeutic range. Frequency distribution data for the numerical variable is presented
in table1.3 and hitogram1.3.
Table 1.3; Drug Concentration Data Observed from 100 Patients.
Concentration Range No. of Patients(frequency)
0-10 15
10-20 25
20-35 30
35-40 20
40-50 10
35
30
25
0-10
No. of Patients
20 10-20
20-30
15 30-40
40-50
10
0
0-10 10-20 20-30 30-40 40-50
Concentration Range (µg/ml)
Table 1.4; Cumulative Drug Concentration Data Observed from 100 Patients.
0-10 15
10-20 40
20-35 70
35-40 90
40-50 100
120
100
No. of Patients
80
60
40
20
0
0-10 10-20 20-35 35-40 40-50
Concentration Range (µg/ml)
Relative frequency represents the proportion of total data points that fall within each
class. To calculate this, divide the frequency of each class by the total number of
observations.
0.35
0.3
Relative Frequency
0.25
0.2
0.15
0.1
0.05
0
0-10 10-20 20-35 35-40 40-50
Concentration Range (µg/ml)
variance or standard deviation. This analysis can provide insights into the variability
of drug responses or the occurrence of side effects.
• Distribution Shape: Assess the distribution for normality (bell curve), skewness, or
kurtosis. For instance, a skewed distribution in drug concentration levels may indicate
that a small subset of patients are metabolizing the drug at significantly faster or
slower rates than others.
Applications in Pharmaceutical Research
• Drug Stability Studies: In pharmaceutical stability testing, frequency distributions
can be used to examine the variation in chemical composition or the degradation of a
drug over time, under varying environmental conditions.
• Clinical Trials: Frequency distributions are particularly useful in pharmacokinetic
and pharmacodynamics studies to analyse the spread of drug concentrations, response
rates, or adverse events across different patient cohorts.
• Pharmacovigilance: In Pharmacovigilance, frequency distributions play a crucial
role in tracking the occurrence of adverse drug reactions (ADRs) or side effects,
enabling the identification of emerging trends or outliers in drug safety data.
• Dosage Studies: Frequency distributions also aid in the analysis of how different
dosages of a drug affect different populations and assist in finding appropriate dosage
ranges along with alerting patients at risk of under-or over-dosing.
Statistics play a vital role in pharmaceutical research and practice to make sensible
decisions. Some of the primary areas where statistical methods have been applied in
pharmacy include:
1.7.1. Clinical Trials
Clinical trials are fundamental for assessing the safety and efficacy of new drugs and
treatments. Statistical techniques are essential in various stages of trial design and
analysis:
Study Design: Statistical methodologies are employed to create robust clinical
trial designs, ensuring accurate evaluation of interventions. Key strategies
include randomization, stratification, and blinding, which minimize bias and
enhance the reliability of results. Randomized Controlled Trials (RCTs) are
considered the gold standard for clinical trials. Placebo-controlled trials are
conducted to differentiate the therapeutic effect of a drug from a placebo,
while crossover trials involve participants receiving both treatments in
sequence, allowing each to serve as their own control.
Sample Size Calculation: Prior to conducting trials, statisticians calculate the
required sample size to ensure the study is adequately powered to detect
significant differences between treatment groups.
Analysis of Results: Various statistical tests (e.g., t-tests, ANOVA, chi-square
tests) are employed to compare results across treatment and control groups.
Advanced techniques, such as regression analysis and survival analysis, are
used for more intricate data, with survival analysis visualized using Kaplan-
Meier curves and the impact of multiple variables assessed using Cox
Proportional Hazards models.
1.7.3. Pharmacokinetics
Pharmacokinetics, which studies the movement of drugs within the body, relies on
statistical models to analyze key relationships:
Dose-Response Relationships: Statistical models help define how different
dosages of a drug influence its concentration in the body over time. Methods
like compartmental modeling or non-compartmental analysis are commonly
used to analyze pharmacokinetic data.
Bioequivalence Studies: In the development of generic drugs, statistical
methods (e.g., ANOVA) are applied to compare the pharmacokinetic
parameters (such as Cmax, Tmax, AUC) of a generic drug with its branded
counterpart, ensuring that the generic is therapeutically equivalent.
1.7.4.. Pharmacovigilance
Pharmacovigilance-the safety monitoring of drugs-involves statistical analysis to
identify adverse effects and evaluate drug safety:
Adverse Drug Reactions (ADR) Reporting: Statisticians analyze ADR reports
to identify trends, evaluate the risks associated with drug side effects, and
monitor overall drug safety. Techniques such as calculating odds ratios and
relative risks, or conducting meta-analyses, are commonly used for this
purpose.
Data Mining and Machine Learning: In the era of big data, machine learning
algorithms are used to predict individual patient responses to medications,
considering clinical, demographic, and genetic data.
1.7.8. Drug Formulation
Statistical techniques optimize formulation of drugs and improve their effectiveness:
Optimization of Drug Formulations: Statistical tools, such as Design of
Experiments (DOE), are used to study drug formulation issues-solubility,
stability, and bioavailability. These statistical tools try to find a most effective
and efficient formulation.
1.7.9. Data Interpretation and Evidence-Based Pharmacy
Evidence-based pharmacy is highly dependent on statistics for application in clinical
decision:
Relative Risk (RR) and Odds Ratio (OR): These statistical measures help
quantify the relationship between a drug (or exposure) and its associated
outcomes, such as disease incidence.
Number Needed to Treat (NNT) and Number Needed to Harm (NNH): These
metrics help assess the effectiveness and safety of treatments, guiding
clinicians in determining the best therapeutic approaches.
1.7.10. Meta-Analysis
Meta-analysis is a procedure for combining statistical results from various studies to
enhance robustness and understanding of treatment effectiveness:
Fixed-Effect vs. Random-Effects Models: These statistical models are used to
combine results from different studies, accounting for variations between
them.
Forest Plots: These visual tools are used to represent the results of meta-
analyses, offering a clear overview of the effectiveness of a drug or treatment
approach.
References
1. Lee CM, Soin HK, Einarson TR. Statistics in the pharmacy literature. Annals
of Pharmacotherapy. 2004;38(9):1412-8.
2. De Muth JE. Basic statistics and pharmaceutical statistical applications. CRC
Press; 2014.
3. Indrayan A, Satyanarayana L. Biostatistics for medical, nursing and pharmacy
students. PHI Learning Pvt. Ltd.; 2006.
4. TONG X, SONG J. Correlation and regression analysis in statistics. J.
Liaoning Econ. Manag. Cadre Inst. 2011;5:17-8.
5. Austin Z, Sutton J. Research methods in pharmacy practice: methods and
applications made easy. Elsevier Health Sciences; 2018.
6. Venkatesh MP. Digital Pharma: How Software Solutions are Shaping the
Pharmaceutical Industry.
7. Baayen RH. Word frequency distributions. Springer Science & Business
Media; 2001.
8. Xu Z, Gautam M, Mehta S. Cumulative frequency fit for particle size
distribution. Applied occupational and environmental hygiene.
2002;17(8):538-42.
2. 1. Mean (Average)
The mean is calculated by adding all the values together and dividing the sum by the
total number of values. It is the most widely used measure of central tendency and is
particularly effective for data that follows a normal distribution.
Formula:
∑X
N
Where ∑X is the sum of all values, and N is the number of values.
Example: If a group of five individuals has fasting blood glucose levels of 85, 90, 88,
92, and 80, the mean blood glucose level can be determined by:
85 + 90 + 88 + 92 + 80
5
=435/5=87
the formulas and notation for calculating the sample mean and the population mean
differ. However, the process for determining both means follows a similar approach.
Sample Mean Formula: The sample mean is typically denoted as M or x̄. To calculate
the mean of a sample, the following formula is used:
∑
x=
Population mean formula: The population mean is written as μ (Greek term). For
calculating the mean of a population, use this formula:
∑
µ=
μ: population mean
∑ X : sum of all values in the population dataset
N: number of values in the population dataset
Calculate the midpoint, xi, we use this formula xi = (upper class limit + lower class
limit)/2.
Calculate Mean by the Formula Mean = ∑xifi / ∑fi. Where fi is the frequency and xi
2.2. Median
The median is the middle value in a dataset. The median is determined by arranging
all the individual values in a dataset from smallest to largest and the middle value is
identified. If the data set contains an odd number of values, the median is the exact
middle value. If the data set contains an even number of values, the median is
calculated as the average of the two middle values.
Example for odd number of ungrouped dataset: If a group of 5 persons has the
following fasting blood glucose levels: 85, 90, 88, 92, and 80, the median blood
glucose levels would be calculated by arranging the blood glucose levels in order
from least to greatest number as:80, 85, 88, 90 and 92.
Since we have an odd number of values, the median is simply the exact middle value
88.
Example for even number of ungrouped dataset: If a group of 6 persons has the
following fasting blood glucose levels: 85, 90, 88, 92,80 and 84, the median blood
glucose levels would be calculated by arranging the blood glucose levels in order
from least to greatest number as:80, 84, 85, 88, 90 and 92.
Since we have an even number of values, the median is the average of the two middle
values, 85 and 88, resulting in a median of 86.5
Median of Grouped Data
To determine the median of grouped data, a systematic approach must be followed.
Grouped data is usually presented in a frequency distribution table, where the data is
organized into classes (or intervals) with their respective frequencies.
𝑁
− 𝐶𝐹
Median = L + 2 ×ℎ
𝑓
Where:
The cumulative frequency just greater than 50 is 65, which corresponds to the class
interval 20–30. Therefore, the median class is 20–30.
(50 − 35)
Median = 20 + × 10
30
15
20 + × 10
30
20 + (0.5 × 10) = 25
Median=25
2.3: Mode
The mode is the value that appears most frequently in a dataset. A dataset can have no
mode (if no value repeats), one mode, or multiple modes.
While dealing with grouped data (or frequency distributions), the mode refers to the
value or class interval that appears most frequently. For grouped data, the mode can
be determined using a formula, especially when the data is organized in a frequency
distribution table. The formula for calculating the mode assumes that the data is
relatively uniform or symmetrical distribution within the class intervals. This method
is suitable for unimodal distributions, but identifying the mode can become more
challenging with multimodal distributions.
1. Identify the modal class: The modal class is the class interval with the highest
frequency (the class with the greatest number of observations).
2. Apply the mode formula: After determining the modal class, use the following
formula to calculate the mode:
f1 – f0
Mode = L + ×h
2f1 − f0 − f2
Where:
30 − 15
Mode = 20 + × 10
(2 × 30) − 15 − 20
20 + × 10
20 + × 10
20 + × 10
20 + (0.6) × 10
20 + 6 = 26
Mode=26
The mode is an especially useful measure of central tendency when dealing with
categorical data, as it reveals the category that appears most frequently.
Example: consider the following bar chart displaying the results of a survey about
people’s preferable dosage form: The data collected from the survey is represented in
table 2.1 and the bar chart in fig 2.1.
Table 2.1: Survey results: what is your preferable dosage form?
Dosage form Frequency
Tablets 20
Capsules 15
Syrup 30
Ointment 10
Injection 5
40
Frequency
30
20
10
0
Tablets Capsules Syrup Ointment Injection
Dosage Form
Fig 2.1: Bar diagram for the Survey results: what is your preferable dosage form?
Measures of Central Tendency
35
30
Frequency
25
20
15
10
5
0
100-110 110-120 120-130 130-140 140-150
Systolic blood pressure (mm Hg)
outliers (such as extremely high blood pressure values). In such a case, the mean is an
effective measure for summarizing this dataset.
However, when dealing with a skewed distribution, the median is often the most
reliable measure of central tendency. In a right-skewed distribution (where a few large
values exist), the mean will be higher than the median because the large values pull
the mean to the right, while the median remains closer to the center of the distribution.
Conversely in a left-skewed distribution (where a few small values present), the mean
will be lower than the median, as the smaller values drag the mean to the left, but the
median remains near the center of the dataset.
Example:
If you're analyzing the impact of a new drug on blood pressure and most patients
experience a slight reduction in their blood pressure, but a few patients experience a
significant drop (outliers), the mean reduction in blood pressure will be higher than
the median. In this case, the median will more accurately represent the typical
response of most patients.
For example, consider a following distribution that shows the systolic blood pressure
of individuals in a particular town.
Table 2.3: Systolic Blood Pressure Distribution Data
Systolic blood pressure (mm Hg) Frequency
100-110 5
110-120 10
120-130 40
130-140 25
140-150 20
50
40
Frequency
30
20
10
0
100-110 110-120 120-130 130-140 140-150
Systolic blood pressure (mm Hg)
∑ 𝒙𝒊𝒇𝒊 𝟏𝟐𝟗𝟓𝟎
Estimated 𝑴𝒆𝒂𝒏 = ∑ 𝒇𝒊
= = 𝟏𝟐𝟗. 𝟓
𝟏𝟎𝟎
The median for the above data is calculated as follows.
Class interval Frequency Cumulative Frequency
100-110 5 5
110-120 10 5+10=15
120-130 40 15+40=55
130-140 25 55+25=80
140-150 20 80+20=100
The median class: The total frequency is N=100. Half of N is 100/2=50. The
cumulative frequency just greater than 50 is 55, which corresponds to the class 120–
130. So, the median class is 120–130.
(50 − 15)
Median = 120 + × 10
40
35
120 + × 10
40
50
40
Frequency
40000-80000
30
80000-120000
20
120000-160000
10
160000-200000
0 200000-240000
40000-80000 80000-120000 120000-160000 160000-200000 200000-240000
Platelet Count
∑ 𝒙𝒊𝒇𝒊 𝟏𝟖𝟔𝟎𝟎
Estimated 𝑴𝒆𝒂𝒏 = ∑ 𝒇𝒊
= = 𝟏𝟖𝟔
𝟏𝟎𝟎
The median for the above data is calculated as follows.
Platelet Count (thousands) Frequency(f) Cumulative frequency
40-80 10 10
80-120 0 10
120-160 0 10
160-200 45 55
200-240 45 100
The median class: The total frequency is N=100. Half of N is 100/2=50. The
cumulative frequency just greater than 50 is 55, which corresponds to the class 160–
200. So, the median class is 160–200.
( )
Median = 160 + × 40
160 + × 40
The mean is significantly affected by an extremely low platelet count, whereas the
median remains unaffected. Therefore, the median provides a more accurate
representation of the “typical” platelet count in hospitalised patients.
For ordinal data, the median or mode is generally the more suitable measure of central
tendency, while for categorical data, the mode is the preferred choice. When choosing
between the mean and median, as the better measure of central tendency, which type
of statistical hypothesis test is appropriate for data is also determined. The mean is
typically used in parametric tests, whereas the median is favoured in nonparametric
tests.
The following table can be referenced to determine the most appropriate measure of
central tendency for different types of variables:
Table2.5: Preferable Measure of Central Tendency for different variables
Type of Variable Best Suitable Measure of Central Tendency
Nominal Mode
Ordinal Median
Interval/Ratio (not skewed) Mean
Interval/Ratio (skewed) Median
2.5: Empirical Relation between Measures of Central Tendency
For example, when tasked with calculating the mean, median, and mode of
continuous grouped data, you can first calculate the mean and median using their
respective formulas, and then determine the mode using this empirical relationship.
Example: The median and mode for a given data set are 26 and 24 respectively. Find
the approximate value of the mean for this data set.
2Mean + Mode = 3Median
2Mean=3Median-Mode
2Mean=3×26-24
2Mean=78-24=54
Mean = 27
Median:
When the distribution of drug efficacy responses is skewed, the median is a more
reliable measure of central tendency than the mean. For example, in cancer trials,
examining the median survival time or time to the first response can reduce the
influence of outliers.
Mode:
The mode is useful when identifying the most frequent outcome is important. For
example, in studies of adverse drug reactions (ADRs), the mode can highlight the
most commonly occurring side effect among patients.
2. Pharmacokinetics (PK) and Pharmacodynamics (PD)
Mean Drug Concentrations:
In pharmacokinetics, the mean drug concentration is essential for understanding how
a drug behaves in the body. It helps inform dosing regimens and determines the
therapeutic range of drugs.
Half-life and other PK Parameters:
Measures of central tendency can aid in evaluating the distribution of
pharmacokinetic parameters such as half-life, clearance, and volume of distribution,
helping optimize dosing strategies.
3. Pharmaceutical Quality Control and Manufacturing
Quality Control Testing:
In pharmaceutical manufacturing, the mean and standard deviation are employed to
assess the consistency and quality of drug products (e.g., tablet weight, dissolution
rate, or active ingredient concentration). Ensuring most samples meet desired
specifications is crucial for regulatory compliance.
Lot-to-Lot Consistency:
The mean and median can be used to monitor variability between different
manufacturing lots, ensuring batch consistency. Outlier analysis can identify
significant deviations from expected values.
4. Adverse Drug Reactions (ADR) Analysis
Mode and Frequency:
The mode can help to identify the most frequent ADRs, while the mean and median
summarize the overall occurrence of these reactions in clinical trial populations,
aiding in safety monitoring and post-marketing surveillance.
Toxicity Studies:
In preclinical toxicology studies, the mean dose at which toxicity occurs (e.g., lethal
dose or therapeutic index) is critical for establishing safety profiles of new drugs.
5. Bioequivalence Studies
Mean Pharmacokinetic Parameters:
In bioequivalence studies, comparing the distribution of pharmacokinetic parameters
(e.g., Cmax, Tmax, AUC) of generic and brand-name drugs ensures that the generic
drug performs similarly in terms of absorption and overall exposure.
Median:
The median is also useful in comparing the distribution of pharmacokinetic
parameters between the generic and reference drugs, especially when there is
variability in absorption rates.
6. Patient Demographics and Subgroup Analysis
Mean and Median Demographic Data:
In clinical research, measures of central tendency are used to analyze patient
demographics such as age, weight, gender, etc. ensuring that the trial sample is
representative of the target population. For instance, the mean age of participants can
help assess the alignment of the clinical trial sample with the general population.
Response by Subgroups:
When analyzing data by subgroups (e.g., age groups, gender, comorbidities), the
mean or median response to a drug can help identify differences in efficacy or side
effects across various population segments.
7. Dosing and Therapeutic Drug Monitoring (TDM)
Therapeutic Ranges and Dosing:
In therapeutic drug monitoring, the mean or median drug concentration is used to
determine the most suitable dosing regimen to achieve optimal therapeutic levels
while avoiding toxicity.
Drug Interactions:
When studying drug interactions, measures of central tendency can summarize the
effects of concomitant medications on the mean drug concentration or response in
patients.
8. Pharmaceutical Economics and Cost-Effectiveness Analysis
Cost per Outcome:
References:
1. Khorana A, Pareek A, Ollivier M, Madjarova SJ, Kunze KN, Nwachukwu
BU, Karlsson J, Marigi EM, Williams III RJ. Choosing the appropriate
measure of central tendency: mean, median, or mode?. Knee Surgery, Sports
Traumatology, Arthroscopy. 2023 (1):12-5.
2. McGrath S, Zhao X, Steele R, Thombs BD, Benedetti A, DEPRESsion
Screening Data (DEPRESSD) Collaboration. Estimating the sample mean and
standard deviation from commonly reported quantiles in meta-analysis.
Statistical methods in medical research. 2020(9):2520-37.
3. MODE MM. Diagrammatic Representation of the Central Tendency.
STATISTICS AND DATA INTERPRETATION IN EDUCATIONAL
RESEARCH. 2023:130.
4. RAY M, SHARMA H, SINGH U. Statistical methods. Ram Prasad
Publications (RPH); 1994.
5. Larson MG. Descriptive statistics and graphical displays. Circulation. 2006 Jul
4;114(1):76-81.
6. Manikandan S. Measures of central tendency: The mean. Journal of
pharmacology & pharmacotherapeutics. 2011;2(2):140.
7. Chakrabarty D. Measuremental Data: Seven Measures of Central Tendency.
International Journal of Electronics. 2021;8(1):15-24.
8. McCluskey A, Lalkhen AG. Statistics II: Central tendency and spread of data.
Continuing Education in Anaesthesia, Critical Care & Pain. 2007;7(4):127-30.
9. Lee CM, Soin HK, Einarson TR. Statistics in the pharmacy literature. Annals
of Pharmacotherapy. 2004 Sep;38(9):1412-8.
10. Austin Z, Sutton J. Research methods in pharmacy practice: methods and
applications made easy. Elsevier Health Sciences; 2018 Feb 21.
11. Rosidah R, Ikram FZ. Measure of central tendency: undergraduate students’
error in decision-making perspective. International Journal of Education.
2021;14(1):39-47.
3. Measures of Dispersion
3.1. Range
The range represents the difference between the highest and lowest values in a
dataset. In pharmaceutical research, the range plays a crucial role in identifying
optimal conditions, dosage, and formulations for a drug, as well as evaluating
their safety, efficacy, and behaviour in the body. The range is calculated with the
following formulae.
6, 15, 20, 8, 3
Maximum value = 20
Minimum value = 3
Then,
Range=20−3=17
During the initial phases of clinical trials, researchers assess a spectrum of drug
dosages to identify both the minimum effective dose and the maximum tolerated dose.
For example:
A study may involve testing different dosages to evaluate the potential for
adverse effects and determine a safe dose for human use.
Pharmaceutical companies frequently test the efficacy range of a drug across various
patient populations, disease severities, or in combination with other therapies. For
example:
The drug may be tested on patients with varying levels of disease severity (e.g.,
mild, moderate, and severe conditions) to determine its effectiveness across these
subgroups.
3.2. Variance
3.2.1.Estimation of Variance
It is calculated by averaging the squared differences between each data point and the
mean of the dataset.
For a population with values x1, x2... x N and population mean μ, the variance σ2is:
𝜎 = ∑(𝑥𝑖 − 𝜇)2
Where:
When dealing with a sample (a subset of the population), the variance formula is
modified to account for the possibility that the sample mean may not perfectly
represent the population mean. The sample variance (s2) is given by:
s = ∑(𝑥𝑖 − 𝑥 ) 2
Where:
1
x = xi
n
2. Compute the squared deviations from the mean for each data point:
(xi−xˉ) 2
∑(xi−xˉ)2
𝑠 = ∑(𝑥𝑖 − 𝑥 ) 2
Example
Estimate the variance observed from the disintegration time data of 6 tablets: 2, 4, 6,
8, 10 and 12
(2−7)2=25,(4−7)2=9,(6−7)2=1,(8−7)2=1,(10−7)2=9(12 -7)2 = 25
25+9+1+1+9+25=70
Divide by n−1=6−1=5:
s2=70/5=14
Standard deviation is the square root of the variance and provides a measure of the
average distance between each data point and the mean. It is more interpretable than
variance because it is expressed in the same units as the data.
For a population with values x1, x2... x Nand population mean μ, the standard
deviation σ is:
𝜎= ∑(𝑥𝑖 − 𝜇)2
Where:
When dealing with a sample, the formula for the sample standard deviation (s) is
given by:
2
𝑠= ∑(𝑥𝑖 − 𝑥− )
Where:
Estimate the sample standard deviation from the disintegration time data of 6
tablets: 2, 4, 6, 8, 10 and 12
(2−7)2=25,(4−7)2=9,(6−7)2=1,(8−7)2=1,(10−7)2=9(12 -7)2 = 25
25+9+1+1+9+25=70
Divide by n−1=6−1=5:
s2=70/5=14
s=14=3.74
∑( )
s= ∑(x ) −
Example: Estimate the sample standard deviation from the disintegration time data
of 6 tablets: 2, 4, 6, 8, 10 and 12
x X2
2 4
4 16
6 36
8 64
10 100
12 144
Total 42 364
1 (42)
𝑠= 364 −
5 6
1 1764
𝑠= 364 −
5 6
1
𝑠= [364 − 294]
5
1
𝑠= [70]
5
𝑠 = √14
=3.74
When dealing with grouped data (data that has been organized into intervals or
classes), calculating the standard deviation requires a slightly different methodology
compared to raw data. The step-by-step process to estimate the standard deviation for
grouped data is as follows.
For each class interval, calculate the midpoint (xi) by averaging the lower and
upper bounds of the interval. The formula to find midpoint is:
lower bound of the class + upper bound of the class
𝑥𝑖 =
2
Next, use the given frequency distribution to identify how many data points
belong to each class interval. Let fi represent the frequency of the i-th class.
The mean (xˉ) of the grouped data is calculated using the formula for a
weighted average, where the midpoints and their corresponding frequencies
are used:
∑
x = ∑
Where:
(xi − x ) × fi
∑ ( )
σ = ∑
6. Take the square root of the variance to get the standard deviation:
σ = √σ
∑
Calculate the mean: 𝑥 = ∑
=2450/100=24.5
Alternatively, the standard deviation can also be computed with the following
formulae.
1 ∑(fxi)
f(xi) −
∑f − 1 ∑f
1 (2450)
77500 −
99 100
1 6002500
77500 −
99 100
1
[[77500 − 60025]
99
1
[[17575]
99
√176.51 =13.28
results such as degradation rates or potency loss to assess the robustness of the
product.
3.5.8. Pharmacogenomics
Estimation of CV
μ = ∑ xi
𝜎 = ∑(𝑥𝑖 − 𝜇)
𝐶𝑉 = × 100
Where:
o σ = standard deviation
o μ = mean of the dataset
3.5.2. Limitations of CV
Sensitivity to small mean values: When the mean of a dataset is very small,
the CV can become excessively large, even if the standard deviation is not
large, which might lead to misinterpretation of variability.
Not suitable for skewed distributions: CV assumes that the data follows a
normal distribution. For datasets that are highly skewed, the CV may not
provide an accurarte representation of variability.
Influence of extreme values: Extreme outliers can have a significant impact on
the CV, leading to distorted results. In such case data cleaning or
transformation methods may be required to reduce the influence of these
outliers.
Formula: IQR=Q3−Q1
Where:
Q1 (25th percentile) is the value below which 25% of the data falls.
Q3 (75th percentile) is the value below which 75% of the data falls.
20,28,30,32,34,36,40,42,48,50
2. Find the Median (Q2): The median is the middle value of the sorted dataset. Since
there are 10 data points (an even number), the median is the average of the 5th and 6th
values.
3. Find Q1 (First Quartile): The first quartile (Q1) is the median of the lower half of
the dataset (values below the overall median).
4. Find Q3 (Third Quartile): The third quartile (Q3) is the median of the upper half of
the dataset (values above the overall median).
5. Calculate the IQR: Now that we have Q1 and Q3, we can calculate the
Interquartile Range (IQR):
IQR=Q3−Q1=42−30= 12
The IQR of 12 ng/mL tells us that the middle 50% of the drug concentrations
in this sample fall within a range of 12 ng/mL (from 30 ng/mL to 42 ng/mL).
A smaller IQR suggests that the data is more consistent, while a larger IQR
indicates more variability in the concentrations across patients.
3.5.6.. Limitations
Does Not Reflect Entire Distribution: Although the IQR is useful for
understanding the spread of the central data, it doesn't account for the
behaviour of the entire dataset, especially the extremes (very high or low
values).
Requires Proper Data Handling: In cases where the data has many missing
values or is skewed, care must be taken when calculating IQR to ensure
accurate results.
The mean absolute deviation is a less sensitive measure to outliers than variance
or standard deviation. Mean Absolute Deviation (MAD) is a statistical measure
used to quantify the dispersion or spread of a dataset. It represents the average of
the absolute deviations from the mean of a dataset. This measure is particularly
valuable for researchers or analysts who are interested in understanding the
variability or consistency of data, as it is less sensitive to extreme values (outliers)
than other measures like variance or standard deviation.
𝑀𝐴𝐷 = ∑|𝑥𝑖 − 𝜇|
Where:
∣99−100∣=1,∣100−100∣=0.,∣101−100∣=1,∣98.0−100∣=2,∣102−100∣=2
This MAD value of 1.2 mg indicates that the deviation of each sample's potency from
the mean is, on average, 1.2 mg. If this MAD value were very high, it could suggest
variability in the production process, requiring further investigation.
The Standard Error of the Mean (SEM) is a statistical measure that quantifies the
amount of variability or dispersion of sample means around the population mean. It
provides an estimate of how much the sample mean (average) of a dataset is likely to
differ from the true population mean.
SEM=
Where
1. Sample Size: Larger sample sizes generally lead to a smaller SEM. This is
because increasing the number of observations in the sample reduces the error
in estimating the true population mean.
2. Standard Deviation: The SEM is directly affected by the variability in the data.
Higher variability (larger σ) leads to a larger SEM.
Interpretation: The SEM reflects how precise the sample mean is as an estimate of
the population mean. A smaller SEM means that the sample mean is a more
accurate reflection of the population mean.
Sample 1: 35 mg
Sample 2: 38 mg
Sample 3: 40 mg
Sample 4: 37 mg
Sample 5: 42 mg
Calculate the Standard Error of the Mean (SEM) for the sample.
The first step is to find the mean (xˉ)/n) of the sample. This is done by adding all the
sample values and dividing by the number of samples.
xˉ=(35+38+40+37+42)/5
=192/5=38.4
Next, we need to calculate the sample standard deviation (SD). This step involves
finding how much each sample value deviates from the mean, squaring the deviations,
and then averaging them.
For each sample, subtract the mean (46 mg) from the sample value:
Sample 1: 35−38.4=−3.4
Sample 2: 38−38.4=−0.4
Sample 3: 40−38.4=1.6
Sample 4: 37−38.4=−1.4
Sample 5: 42−38.4=3.6
(−3.4)2=11.56
(−0.4)2=0.16
(1.6)2=2.56
(−1.4)2=1.96
(3.6)2=12.96
The variance is the average of the squared deviations. For a sample, we divide the
sum of squared deviations by n−1, where n is the sample size (in this case, 5).
Variance=7.3
SD=7.3=2.7 mg
SEM=
.
= =2.7/2.236=1.21
This means that the sample mean (38.4 mg) has an associated SEM of 1.21 mg,
indicating that the sample mean is likely to be within this range of the true population
mean based on this sample size and variability.
In the pharmaceutical industry, the SEM is widely used in various stages of drug
development, quality control, and clinical trials. Here are some key applications:
Identifying outliers and variability: SEM can help identify outliers or unusual
data points in clinical safety trials. By assessing how much variation exists in
the reporting of adverse events, researchers can better determine whether the
events are truly rare or just a result of statistical noise.
Risk assessment: SEM can be part of risk assessment tools that help predict
the likelihood of adverse reactions in a population. A high SEM in adverse
event data suggests a high level of uncertainty about the generalizability of
safety results, prompting more rigorous monitoring.
References:
4. Bickel PJ, Lehmann EL. Descriptive statistics for nonparametric models. III.
Dispersion. InSelected works of EL Lehmann 2011 (pp. 499-518). Boston,
MA: Springer US.
10. Mason RL, Gunst RF, Hess JL. Statistical design and analysis of experiments:
with applications to engineering and science. John Wiley & Sons; 2003;25.
4. Correlation
Whentwo variables show no consistent relationship, they are said to have zero or no
correlation. In such cases, knowing one variable’s value does not provide any insight
into the other.For example, there may be no correlation between the colour of a
pharmaceutical pill and the incidence of side effects reported by patients, or between
the month of the year and the drug’s efficacy.
necessarily mean one is causing the other.In the context of the pharmaceutical
industry, causal correlation refers to a relationship where one variable directly impacts
another. For example, in a clinical trial, a pharmaceutical company may develop a
drugintended to lowerBP. If patients who take the drug exhibit a significant reduction
in BP compared to those who receive a placebo, the medication is considered the
cause (independent variable),while the reduction in BP is the effect (dependent
variable).
Explained variance
𝑅 =
Total variance
𝑅= 𝑅
To evaluatethe combined impact of these factors on the drug's ability to reduce BP,
the company could perform amultiple correlation analysis, typically through multiple
linear regression techniques.
Assume that the relationship between these variables is represented by the following
multiple linear regression equation:
Where:
Interpretation of Results:
The multiple correlation coefficients (R) indicates the degreeto which the
combination of dose, age, and gender explains the variation in BP reduction.
The R-squared (R²) value revealsthe proportion of variance in BP reduction is
explained by the independent variables (dosage, age, and gender).
Often referred to as the Pearson correlation coefficient (PCC), the Karl Pearson
Correlation is a statistical measurethat evaluates the strength and direction of the
linear association between two variables. It is denoted by r, with value
rangingbetween-1 to +1.
A value of 0suggests that there isabsence any linear relationship between two
variables.
Formula:
Where:
Correlation Degree
Problem:
Consider the following example where we data is given on the time and the
percentage of drug dissolved
We aim to find the PCC between time and percent drug dissolved.
Step-by-Step Calculation:
394
𝑟=
√40 ∗ √3914
394
=
6.325 ∗ 62.56
394
=
395.69
= 0.9957
∑(𝑥𝑦) − (∑ 𝑥 ∑ 𝑦)
𝑟=
(∑ ) (∑ )
∑𝑥 − ∑𝑦 −
2254 − (9300)
𝑟=
220 − 23134 −
2254 − 1860
𝑟=
√220 − 180√23134 − 19220
394
𝑟=
√40√3914
394
𝑟=
6.325 ∗ 62.56
394
𝑟=
395.69
𝑟 = 0.9957
To calculate partial correlation between two variables (eg. x and y), while accounting
for the influence of a third variable z, the formula is derived as:
Where:
rXY represents the PCC between X and Y (the two variables of interest).
rXZ denotes the PCCbetween X and Z (the controlledvariable).
rYZ signifies the PCC between Y and Z (the controlled variable).
rXY.Z is the partial correlation coefficient between X and Y,with the effect of
Z.
These correlations can be computed using the Pearson correlation formula, and the
results are as follows.
Next, insert these values into the partial correlation formula to calculate the partial
correlation:
=1
Interpretation:
In the observed case, after accountingfor the effect of tabletaging,it appears that as
binder concentrationincreases, thedisintegration time of the tablet tends to
increase.
Where
Example: Imaginea pharmaceutical company has the following data on patient age,
drug dose, andBP reduction. The company seeksto determine the correlation between
age and drug dosewhile accounting for the influence of BP reduction.
rXY=1 (This reflects a perfect positive linear relationship between age and
dosage in this instance).
The Multiple Correlation coefficient can be calculated with the following formulae.
= (1) + [(1) (1 − 1) ]
= 1−0
=1
The multiple correlation coefficients (R) indicatea strong correlation between the
combination of age, dose,and BP reduction.
4.5:CorrelationMeasurement
Correlation can be quantified using three different methods; viz., Scatter Diagram,
Karl Pearson’s Coefficient of Correlation, and Spearman’s Rank Correlation
Coefficient.
The scatter diagram is a straightforward and visually appealing method used to evaluate
thecorrelation between two variables by graphing theirbivariate distribution.This
methodaids tofindthe nature of the relationship between two variables andprovides a
clear visual representationoffering insights regarding the nature of the association
between the two variablesto the researcher or analyst. It is one ofthe basicapproachfor
finding the relationship between two variables since it does not require any numerical
calculations.
The two essential steps for creating a Scatter Diagram or Dot Plot:
1. Plot the values of the variables (say X and Y) along the X-axis and Y-axis
respectively.
2. Place dots on the graph corresponding to each pair of values.
Example:
Use a scatter diagram to represent the following values of X and Y, and then
analyse type and degree of correlation.
1 9
2 17
4 35
6 56
8 74
10 98
120
80
60
40
20
0
0 2 4 6 8 10 12
Time (hr)
The scatter diagram illustrates an upward trend in data points, moving from the lower
left-hand corner to the upper right-hand corner of the graph. This indicates a Positive
Correlation between the values of X and Y variables.
6∑𝐷
𝑟 = 1−
𝑁 −𝑁
D = Rank differences
N = Number of variables
Correlation
Example:
Clinical Trials:
Efficacy vs. Dose Response: Clinical trials often explore the correlation
between drug dose and therapeutic response to identify the optimal dose for
maximum benefit with minimal side effects.
Patient Outcomes: Correlating genetic or phenotypic data with clinical
outcomes is essentialin personalized medicine, where treatments are tailored to
individual patient profiles, improving success rates and reducing adverse
effects.
Formulation Development:
References
5. Regression:
Correlation and regression are both statistical methods used to examine the
relationship between two or more variables, but they serve different purposes and
provide different insights. They are summarised as follows
Least Square Method formula is used to find the best-fitting line through a set of data
points. For a simple linear regression, which takes the form 𝑦 = 𝑎 + 𝑏𝑥, where y is
the dependent variable, x is the independent variable, b is the slope of the line, and a is
the y-intercept, the least squares method uses a specific formulas to calculate the
values of the slope (b) and the intercept (a) based on the given data points:
(∑ ) (∑ )(∑ )
1. Slope (b) Formula: b = (∑ ) (∑ )
(∑ ) (∑ )
2. Intercept (a) Formula:a =
Where:
Example:
Solution:
(∑ ) (∑ )(∑ )
1. Slope (b) Formula: b = (∑ ) (∑ )
5(1944) − (30)(265)
5(220) − (30)
9720 − 7950
=
1100 − 900
1770
=
200
=8.85
(∑ ) (∑ )
2. Intercept (a) Formula:a =
265 − 8.85(30)
=
5
265 − 265.5
=
5
−0.5
=
5
= −0.1
Y=-0.1+8.85x
𝑎 = ȳ − b 𝑥̄
The steps to find the line of best fit by using the least square method are as follows:
Step 1: Denote the independent variable values as xi and the dependent ones as
yi.
Step 3: Presume the equation of the line of best fit as y = a + bx, where b is the
slope of the line and a represents the intercept of the line on the Y-axis.
Time (hr) Amount of drug release (mg) (x-x) (y-y) (x-x) (y-y) (x-x)2
(x) (y)
2 18 -4 -35 140 16
4 34 -2 -19 38 4
6 52 0 -1 0 0
8 75 2 22 44 4
10 86 4 33 132 16
Σ x=30 Σ y=265
x= Σ x/n=6 y= Σ y/n=53 354 40
b = [354] / (40)
=354/40
=8.85
𝑎 = ȳ − b 𝑥̄
= 53 − 8.85 ∗ 6
= 53 − 53.1
= −0.1
Thus, we obtain the line of best fit as y = -0.1 +8.85x, where values of b and a are
calculated from the formulae defined above. This equation can be used to estimate the
y for a given x. in this example; we can predict the amount of drug release at a
particular time.
For instance, the amount of drug release at 5 th hour can be predicted by substituting
5 in place of x
Example: Draw the regression line x=a+by for the following data
Solution; For a simple linear regression, which is a line of the form x=a+by, where x
is the dependent variable, y is the independent variable, b is the slope of the line, and
a is the x-intercept, the formulas to calculate the slope (b) and intercept (a) of the line
are derived from the following equations:
(∑ ) (∑ )(∑ )
1. Slope (b) Formula: b = (∑ ) (∑ )
(∑ ) (∑ )
2. Intercept (a) Formula:a =
5(1944) − (30)(265)
=
5(17205) − (265)
9720 − 7950
=
86025 − 70225
1770
=
15800
= 0.112
(∑ ) (∑ )
2. Intercept (a) Formula:a =
. ( )
=
.
=
.
=
=0.0628
b=
=0.112
a = 6 – 0.112x53
=6-5.937
=0.063
A line of the form x=a+by
x=0.063+0.112y
This equation can be used to predict the x for a given y. in this example; we can
predict the time required to get a required amount of drug release.
For instance, the time required to get 60 mg of drug release can be predicted by
substituting 60 in place of y
Y=β0+β1X1+β2X2+⋯+βkXk+ϵY
Where:
β₁, β₂ ... βₖ = Coefficients (represent the change in Y for a one-unit change in the
corresponding X variable).
Example: Calculate the multiple regression equation for the following data.
(∑ 𝑋 )
𝑥1 2 = 𝑋 −
𝑛
(∑ 𝑋 )
𝑥2 2 = 𝑋 −
𝑛
(∑ 𝑋 ∑ 𝑦)
𝑥1 𝑦 = 𝑋 𝑦−
𝑛
(∑ 𝑋 ∑ 𝑦)
𝑥2 𝑦 = 𝑋 𝑦−
𝑛
(∑ 𝑋 ∑ 𝑋 )
𝑥1 𝑥2 = 𝑋𝑋 −
𝑛
( )
∑ 𝑥1 2 = 55 − =55-45=10
(20)
𝑥2 2 = 90 − = 90 − 80 = 10
5
(15 ∗ 47)
𝑥1 𝑦 = 162 − = 162 − 141 = 21
5
(20 ∗ 47)
𝑥2 𝑦 = 199 − = 199 − 188 = 11
5
(15 ∗ 20)
𝑥1 𝑥2 = 66 − = 66 − 60 = 6
5
[( )( ) ( )( )] [ ]
b1 =
[( )( ) ( ) ]
=[ ]
= = 2.25
[( )( ) ( )( )] [ ]
b2 =
[( )( ) ( ) ]
= [ ]
= = −0.25
=9.4-(2.25) (3)-(-0.25)(4)
=9.4-6.75+1
=10.4-6.75
=3.65
Y=3.65+2.21(3)-0.25(5)
=3.65+6.63-1.25
10.28-1.25
=9.03
The standard error of regression (also known as the standard error of the estimate) is a
measure of the accuracy of predictions made by a regression model. It quantifies the
typical distance between the observed values and the values predicted by the
regression model. A smaller standard error indicates that the data points are closer to
the regression line. A larger standard error suggests that the data points are more
spread out from the regression line. It helps to evaluate the precision of the regression
coefficients. It is used to construct confidence intervals for the regression parameters
and to perform hypothesis testing.
The standard error of the estimate is a way to measure the accuracy of the predictions
made by a regression model.
∑(𝑦 − ȳ )
𝜎 =
𝑛
Where:
y: The observed value
ŷ: The predicted value
n: The total number of observations
Example: Calculate the standard error of regression for the following data
y = -0.1 +8.85x
=-0.1+17.7=17.6
=-0.1+35.4=35.3
=-0.1+53.1=53.0
=-0.1+70.8=70.7
=-0.1+88.5=88.4
∑(𝑦 − ȳ )
𝜎 =
𝑛
27.1
= = √5.42 = 2.33
5
ŷ = = -0.1+8.85 (2)
=-0.1+17.7=17.7
And we can obtain the 95% confidence interval for this estimate by using the
following formula:
=17.7-1.96(2.33), 17.7+1.96(2.33)
=17.7-4.56, 17.7+4.56
=13.14, 22.26
Dose-Response Modelling:
Regression techniques are used to connect the relationship between drug
dosage and the corresponding biological response. A typical application
involves fitting a logistic or nonlinear regression model to predict the effect of
drug doses on patient outcomes.
Pharmacokinetics (PK) and Pharmacodynamics (PD) Analysis:
Regression models serve to examine the relationship between drug
concentration in the bloodstream and the pharmacological effects, helping to
optimize dosing regimens strategies.
Clinical Trial Data Analysis:
Regression is employed to analyze the effectiveness of a treatment, adjusting
for variables such as age, gender, or baseline health conditions, ensuring that
the observed effects are attributed to the drug rather than external confounding
factors.
References
2. Shi R, Conrad SA. Correlation and regression analysis. Ann Allergy Asthma
Immunol. 2009;103(4):S34-41.
3. Ngo TH, La Puente CA. The steps to follow in a multiple regression analysis.
InProceedings of the SAS Global forum 2012 (pp. 22-25). Princeton, NJ,
USA: Citeseer.
6. Probability
The general definition of probability for a random experiment or event is given by the
ratio of the number of favourable outcomes to the total number of possible outcomes,
assuming all outcomes are equally likely. Mathematically, the probability P(A) of an
event A is defined as:
1. Sample Space (S): The complete set of all possible outcomes for a random
experiment.
2. Event: A subset of the sample space; which can consist of one or more
outcomes.
3. Complementary Events: An event that signifies the non-occurrence of another
event, denoted Ac, with probability P (Ac) =1−P (A).
4. Independent Events: Two events are independent if the occurrence of one does
not affect the probability of the other.
5. Conditional Probability: The probability of an event occurring, given that
another event has already taken place.
Types of Probability:
The normal distribution (also known as the Gaussian distribution) is one of the most
important probability distributions in statistics and plays a crucial role in fields like
the pharmaceutical sciences, medicine, and many others. It is commonly used to
model continuous random variables that tend to cluster around a central mean value.
Some of the important properties of the normal distribution are listed below:
In a normal distribution, the mean, median and mode are equal.(i.e., Mean =
Median= Mode).
The total area under the curve should be equal to 1.
The normally distributed curve should be symmetric at the centre.
There should be exactly half of the values are to the right of the centre and
exactly half of the values are to the left of the centre.
The normal distribution should be defined by the mean and standard deviation.
The normal distribution curve must have only one peak. (i.e., Unimodal)
The curve approaches the x-axis, but it never touches, and it extends farther
away from the mean.
The general probability density function (PDF) for the normal distribution is:
1 ( )
𝑓 (𝑥|𝜇, 𝜎 ) = 𝑒
√2𝜋𝜎
Where:
The data followed normal distribution and the probability can be computed with
the following formulae.
Probability = The number of possible outcomes/ total number of outcomes
=10/100=0.1
The probability of getting a tablet having a weight 99 mg
=20/100=0.2
The probability of getting a tablet having a weight 100 mg
=40/100=0.4
The probability of getting a tablet having a weight 101 mg
=20/100=0.2
The probability of getting a tablet having a weight 102 mg
=10/100=0.1
The normal probability distribution graph is constructed in between tablet weight
(mg) in x axis and probability on y axis.
Tablet weight(mg) probability
98 0.1
99 0.2
100 0.4
101 0.2
102 0.1
0.45
0.4
0.35
0.3
Probability
0.25
0.2
0.15
0.1
0.05
0
97 98 99 100 101 102 103
Tablet Weight (mg)
Example: Find the probability of getting a tablet having a weight less than or
equal to 100 mg
=1-0.7=0.3
We want to calculate the probability that a randomly selected tablet will have a
concentration of the active ingredient between 95 mg and 105 mg.
𝑋−𝜇
𝑍=
𝜎
Where:
Z1=95−100/2 =−5/2=-2.5
Z2=105−100/2=5/2=2.5
The probability that a randomly selected tablet will have a concentration of the active
ingredient between 95 mg and 105 mg is 0.9876 or 98.76%.
σ
CI = X ± Zα/2 ∗
√n
Where:
The trial conducted on 100 patients, indicated that the average reduction in systolic
blood pressure for these patients is 15 mmHg and the standard deviation of the
reduction in systolic blood pressure for the population is known to be 10 mmHg.
Calculate a 95% confidence interval for the mean reduction in systolic blood pressure
based on this sample.
𝜎
𝑆𝐸 =
√𝑛
= = =1
√
𝑍𝜎
𝑀𝐸 = ∗ 𝑆𝐸 = 1.96 ∗ 1 = 1.96
2
𝐶𝐼 = 𝑥 ± 𝑀𝐸 = 15 ± 1.96
CI=[15−1.96,15+1.96]=[13.04,16.96]
Batch Testing and Consistency: In the pharmaceutical industry, the quality of drug
products must be consistent. Normal distribution is commonly used to assess the
variations in parameters such as drug content, tablet weight, dissolution rates, and
other critical factors during manufacturing.
2. Clinical Trials
Clinical trial outcomes including blood pressure readings, blood glucose levels,
serum drug concentrations, and other continuous variables are assumed to follow
normal distribution. Statistical methods that rely on normality, such as t-tests or
ANOVA, are frequently employed to evaluate the efficacy of drugs and their
potential side effects.
3. Bioequivalence Testing:
5. Stability Testing:
The shelf life and stability of pharmaceutical products are crucial for ensuring
both their effectiveness and safety. Normal distribution can be used to model
degradation rates of active ingredients over time under different storage
conditions.
7. Population Pharmacokinetics
Each trial must have exactly two possible outcomes, commonly referred to as
"success" and "failure". These outcomes are mutually exclusive.
3. Independence:
The trials must be independent, meaning the outcome of any trial does not affect
the outcome of any other trial.
The probability of success, denoted by p, must remain the same for each trial. The
probability of failure is 1−p, and it also remains constant throughout all trials.
𝑃( : , ) = 𝑛𝐶 𝑝 (1 − 𝑝)
Where:
The mean of the binomial distribution is np, and the variance of the binomial
distribution is np (1 − p).
A pharmaceutical company has developed a new drug and clinical trials show that
80% of patients respond positively to the drug (i.e., the probability of success is 0.8).
In a sample of 10 patients, we want to determine the probability of having a certain
number of positive responses
In this case:
We can plot the binomial distribution of the number of successes (positive responses)
for this scenario.
𝑃( ) = 𝑛𝐶 𝑝 (1 − 𝑝)
For k=10
!
10𝐶 = =1 Is the binomial coefficient
!( )!
For k=9
10!
10𝐶 = = 10
9! (10 − 9)!
For k=8
10!
10𝐶 = = 45
8! (10 − 8)!
For k=7
10!
10𝐶 = = 120
7! (10 − 7)!
For k=6
10!
10𝐶 = = 210
6! (10 − 6)!
For k=5
10!
10𝐶 = = 252
5! (10 − 5)!
For k=4
10!
10𝐶 = = 210
4! (10 − 4)!
For k=3
10!
10𝐶 = = 120
3! (10 − 3)!
For k=2
10!
10𝐶 = = 45
2! (10 − 2)!
For k=1
10!
10𝐶 = = 10
1! (10 − 1)!
For k=0
10!
10𝐶 = =1
0! (10 − 0)!
3 0.000786
4 0.0055
5 0.026
6 0.088
7 0.2013
8 0.30199
9 0.2684
10 0.1074
Probability
0.200000000000
0.150000000000
0.100000000000
0.050000000000
0.000000000000
-0.050000000000 0 2 4 6 8 10 12
Number of Successes
Example: Out of 100 patients, the drug is known to have a 70% success rate.
Calculate the mean and standard deviation of the number of patients who respond
positively to the drug out of a sample of 100 patients.
Solution:
μ=n⋅p
Where:
μ=100⋅0.7=70
The mean number of patients expected to respond positively to the drug is 70.
σ= n⋅p⋅q
The standard deviation of the number of patients who will respond positively
to the drug is approximately 4.58.
Interpretation:
Mean (μ) = 70: On average, 70 out of 100 patients are expected to respond
positively to the drug in this clinical trial.
Standard Deviation (σ) ≈ 4.58: There is some variability in the number of
patients responding, and the typical deviation from the mean (70) is
approximately 4.58 patients.
Example1: In a clinical trial, 10 patients receive a new drug with a 90% probability of
responding positively). What is the probability that exactly 5 patients respond
positively
P(x:n,p) = n C x p x (1 - p)n - x
Where:
𝑛!
𝑛𝐶 = is the binomial coefficient
𝑥!(𝑛−𝑥)!
10! 10x9x8x7x6x5x4x3x2x1
10𝐶 = = = 252
5! (10 − 5)! 5x4x3x2x1x5x4x3x2x1
Example2: What is the probability of having no more than 2 defective pills out of
10 pills tested having a defect probability of 1%,
10!
10𝐶 = =1
0! (10 − 0)!
10𝐶 =1
1x1x (0.99)10
=0.9
=0.1x0.91=0.09
=45x0.0001x (0.99)8
= 0.0045x0.92
=0.004
=0.994
The probability of having no more than 2 defective pills out of 10 pills tested is equal
to 0.994
Example3: What is the probability that at least 2 defective pills out of 10 pills tested
P (X≥2) =1−P (X≤1)
Compute P (X≤1) by summing probabilities for x=0 to x=1.
=0.9+0.09=0.99
The probability that at least 2 defective pills out of 10 pills tested =1−P (X≤1)
=1-0.99=0.01
Example4: What is the probability of having no more than 2 defective pills out of
10 pills tested having a defect probability of 10%,
P (X≤1) value from cumulative binomial probabilities table for n=10, x=1 and
p=0.1 is 0.736(Table 6.2)
P (X≥2) =1−0.736
=0.264
1. Clinical Trials:
o Assessing Treatment Success: The distribution can be used to model
the number of patients who respond positively to a treatment out of a
fixed sample size.
o Adverse Events Analysis: Estimating the probability of patients
experiencing side effects.
The Poisson distribution is a discrete probability distribution that models the number
of events occurring within a fixed interval of time or space, given certain conditions.
For a random variable X to follow a Poisson distribution, the following conditions
must be met:
The events must be independent of each other. The occurrence of one event does
not affect the probability of another event occurring.
2. Fixed Interval:
The events are counted within a fixed interval of time, space, or other dimensions.
This interval is typically denoted as a time period, area, or volume in which the
events are expected to happen.
The average number of events that occur in the fixed interval is constant. This
average rate is denoted by λ, which represents the mean rate of occurrence of
events. It is assumed to be the same throughout the interval.
The events cannot occur simultaneously. That is, no more than one event can
happen at an exact point in time or location.
μ=λ
The standard deviation (σ) of a Poisson distribution is the square root of its mean:
𝜎 = √λ
Both the mean and the standard deviation of a Poisson distribution are determined
by the rate λ. The standard deviation is equal to the square roots of the mean in
Poisson distribution. Both the mean and the variability (SD) of the Poisson
distribution increase with the average rate of events increases.
Example: Calculate the mean and standard deviation for the number of severe
adverse events in a sample of 100 patients. Based on preclinical studies or
historical data, it is known that, on average, 2 patients out of every 100 experience
a severe adverse event within a certain time frame.
Solution: From the problem statement, the average number of severe adverse
events per 100 patients is 2. Therefore, for a sample of 100 patients, we set λ=2.
The mean of the Poisson distribution is equal to λ. So, the mean number of severe
adverse events in the sample of 100 patients is:
μ=λ=2
This means that, on average, 2 patients in the trial will experience a severe adverse
event.
The standard deviation of a Poisson distribution is given by the square root of the
rate parameter λ:
𝜎 = √λ
= √2= 1.414
Example: is conducting a clinical trial for a new drug. Based on previous studies
and reports, a pharmaceutical company know that on average, 3 patients per 1000
experience a particular severe adverse event while using the drug.
(a): Estimate the number of expected severe adverse events in a sample of 5000
patients.
From previous data, the company knows that 3 patients out of 1000 experience
severe adverse events. So, for a sample of 5000 patients, the expected number of
severe adverse events can be calculated as:
3
λ= ∗ 5000 = 15
1000
Thus, for 5000 patients, the rate parameter λ=15. This means, on average, we expect
15 severe adverse events in this sample of 5000 patients.
The Poisson probability mass function (PMF) is used to calculate the probability of
observing exactly 10 severe adverse events. The Poisson PMF is given by the
formula:
𝜆 𝑒
𝑃(𝑘) =
𝑘!
Where:
P(k) is the probability of observing exactly k events (in this case, k=10),
λ is the expected number of events (in this case, λ=15),
e is Euler's number (approximately 2.718),
k! is the factorial of k.
15 𝑒
𝑃(10) =
10!
1510 =576650390625
e−15=3.059×10−7
10!=10×9×8×7×6×5×4×3×2×1=3,628,800
576650390625 ∗ 3.059 × 10 − 7
𝑃(10) =
3,628,800
=0.099
The probability of observing exactly 10 severe adverse events in the sample of 5000
patients is approximately 0.099 or 9.9%.
15 𝑒
𝑃(9) =
9!
159 =3844335937
𝑒 =3.059×10−7
9!=9×8×7×6×5×4×3×2×1=3,628,80
3844335937 ∗ 3.059 × 10 − 7
𝑃 (9) =
3,628,80
=0.032
15 𝑒
𝑃(8) =
8!
158 =2562890625
𝑒 =3.059×10−7
8!= 8×7×6×5×4×3×2×1=40320
2562890625 ∗ 3.059 × 10 − 7
𝑃(8) =
40320
=0.019
15 𝑒
𝑃(7) =
7!
157 =170859375
𝑒 =3.059×10−7
7!= 7×6×5×4×3×2×1=5040
170859375 ∗ 3.059 × 10 − 7
𝑃 (7 ) =
5040
=0.01
15 𝑒
𝑃(6) =
6!
156 =11390625
𝑒 =3.059×10−7
6!= 6×5×4×3×2×1=720
11390625 ∗ 3.059 × 10 − 7
𝑃 (6) =
720
=0.00484
15 𝑒
𝑃(5) =
5!
155 =759375
𝑒 =3.059×10−7
5!= 5×4×3×2×1=120
759375 ∗ 3.059 × 10 − 7
𝑃 (5 ) =
120
=0.001935
15 𝑒
𝑃(4) =
4!
154 =50625
𝑒 =3.059×10−7
4!= 4×3×2×1=24
50625 ∗ 3.059 × 10 − 7
𝑃(4) =
24
=0.000645
15 𝑒
𝑃(3) =
3!
153 =3375
𝑒 =3.059×10−7
3!= 3×2×1=6
3375 ∗ 3.059 × 10 − 7
𝑃(3) =
6
=0.000172
15 𝑒
𝑃(2) =
2!
152 225
𝑒 =3.059×10−7
2!= 2×1=2
225 ∗ 3.059 × 10 − 7
𝑃 (2 ) =
2
=0.0000344
15 𝑒
𝑃(1) =
1!
151 =15
𝑒 =3.059×10−7
1!= 1
15 ∗ 3.059 × 10 − 7
𝑃(1) =
1
=0.0000046
15 𝑒
𝑃(0) =
0!
150=1
𝑒 =3.059×10−7
0!= 1
1 ∗ 3.059 × 10 − 7
𝑃 (0) =
1
=0.0000003
(c) Calculate the probability of observing more than 10 severe adverse events
Calculate the probability of observing more than 10 adverse events; we need to find
the cumulative probability for all values greater than 10. This is equivalent to:
Where:
If the average number of adverse reactions arriving at a hospital per day is 3 (i.e.,
λ=3), construct the Poisson distribution curve for the probability of, say, 0, 1, 2,
3, etc., adverse reactions arriving at a hospital on next day .
For x=0
3 𝑒
𝑃 (0) = = 0.05
0!
For x=1
3 𝑒
𝑃(1) = = 0.15
1!
For x=2
3 𝑒
𝑃 (2) = = 0.225
2!
For x=3
3 𝑒
𝑃 (3) = = 0.225
3!
For x=4
3 𝑒
𝑃 (4) = = 0.169
4!
For x=5
3 𝑒
𝑃 (5) = = 0.101
5!
0.25
0.2
Probability
0.15
0.1
0.05
0
0 1 2 3 4 5 6
Number of Adverse Reactions
z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
-3.4 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0002
-3.3 0.0005 0.0005 0.0005 0.0004 0.0004 0.0004 0.0004 0.0004 0.0004 0.0003
-3.2 0.0007 0.0007 0.0006 0.0006 0.0006 0.0006 0.0006 0.0005 0.0005 0.0005
-3.1 0.0010 0.0009 0.0009 0.0009 0.0008 0.0008 0.0008 0.0008 0.0007 0.0007
-3.0 0.0013 0.0013 0.0013 0.0012 0.0012 0.0011 0.0011 0.0011 0.0010 0.0010
-2.9 0.0019 0.0018 0.0018 0.0017 0.0016 0.0016 0.0015 0.0015 0.0014 0.0014
-2.8 0.0026 0.0025 0.0024 0.0023 0.0023 0.0022 0.0021 0.0021 0.0020 0.0019
-2.7 0.0035 0.0034 0.0033 0.0032 0.0031 0.0030 0.0029 0.0028 0.0027 0.0026
-2.6 0.0047 0.0045 0.0044 0.0043 0.0041 0.0040 0.0039 0.0038 0.0037 0.0036
-2.5 0.0062 0.0060 0.0059 0.0057 0.0055 0.0054 0.0052 0.0051 0.0049 0.0048
-2.4 0.0082 0.0080 0.0078 0.0075 0.0073 0.0071 0.0069 0.0068 0.0066 0.0064
-2.3 0.0107 0.0104 0.0102 0.0099 0.0096 0.0094 0.0091 0.0089 0.0087 0.0084
-2.2 0.0139 0.0136 0.0132 0.0129 0.0125 0.0122 0.0119 0.0116 0.0113 0.0110
-2.1 0.0179 0.0174 0.0170 0.0166 0.0162 0.0158 0.0154 0.0150 0.0146 0.0143
-2.0 0.0228 0.0222 0.0217 0.0212 0.0207 0.0202 0.0197 0.0192 0.0188 0.0183
-1.9 0.0287 0.0281 0.0274 0.0268 0.0262 0.0256 0.0250 0.0244 0.0239 0.0233
-1.8 0.0359 0.0351 0.0344 0.0336 0.0329 0.0322 0.0314 0.0307 0.0301 0.0294
-1.7 0.0446 0.0436 0.0427 0.0418 0.0409 0.0401 0.0392 0.0384 0.0375 0.0367
-1.6 0.0548 0.0537 0.0526 0.0516 0.0505 0.0495 0.0485 0.0475 0.0465 0.0455
-1.5 0.0668 0.0655 0.0643 0.0630 0.0618 0.0606 0.0594 0.0582 0.0571 0.0559
-1.4 0.0808 0.0793 0.0778 0.0764 0.0749 0.0735 0.0721 0.0708 0.0694 0.0681
-1.3 0.0968 0.0951 0.0934 0.0918 0.0901 0.0885 0.0869 0.0853 0.0838 0.0823
-1.2 0.1151 0.1131 0.1112 0.1093 0.1075 0.1056 0.1038 0.1020 0.1003 0.0985
-1.1 0.1357 0.1335 0.1314 0.1292 0.1271 0.1251 0.1230 0.1210 0.1190 0.1170
-1.0 0.1587 0.1562 0.1539 0.1515 0.1492 0.1469 0.1446 0.1423 0.1401 0.1379
-0.9 0.1841 0.1814 0.1788 0.1762 0.1736 0.1711 0.1685 0.1660 0.1635 0.1611
-0.8 0.2119 0.2090 0.2061 0.2033 0.2005 0.1977 0.1949 0.1922 0.1894 0.1867
-0.7 0.2420 0.2389 0.2358 0.2327 0.2296 0.2266 0.2236 0.2206 0.2177 0.2148
-0.6 0.2743 0.2709 0.2676 0.2643 0.2611 0.2578 0.2546 0.2514 0.2483 0.2451
-0.5 0.3085 0.3050 0.3015 0.2981 0.2946 0.2912 0.2877 0.2843 0.2810 0.2776
-0.4 0.3446 0.3409 0.3372 0.3336 0.3300 0.3264 0.3228 0.3192 0.3156 0.3121
-0.3 0.3821 0.3783 0.3745 0.3707 0.3669 0.3632 0.3594 0.3557 0.3520 0.3483
-0.2 0.4207 0.4168 0.4129 0.4090 0.4052 0.4013 0.3974 0.3936 0.3897 0.3859
-0.1 0.4602 0.4562 0.4522 0.4483 0.4443 0.4404 0.4364 0.4325 0.4286 0.4247
0.0 0.5000 0.4960 0.4920 0.4880 0.4840 0.4801 0.4761 0.4721 0.4681 0.4641
Table 6.2: Cumulative Binomial probabilities
𝒄
𝒏 𝒙
𝑷[𝑿 ≤ 𝒄] = 𝒑 (𝟏 − 𝒑)𝒏 𝒙
𝒙
𝒙 𝒐
c 0.05 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 0.95
n=1 0 0.950 0.900 0.800 0.700 0.600 0.500 0.400 0.300 0.200 0.100 0.050
1 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
n=2 0 0.903 0.810 0.640 0.490 0.360 0.250 0.160 0.090 0.040 0.010 0.003
1 0.998 0.990 0.960 0.910 0.840 0.750 0.640 0.510 0.360 0.190 0.098
2 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
n=3 0 0.857 0.729 0.512 0.343 0.216 0.125 0.064 0.027 0.008 0.001 0.000
1 0.993 0.972 0.896 0.784 0.648 0.500 0.352 0.216 0.104 0.028 0.007
2 1.000 0.999 0.992 0.973 0.936 0.875 0.784 0.657 0.488 0.271 0.143
3 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
n=4 0 0.815 0.656 0.410 0.240 0.130 0.063 0.026 0.008 0.002 0.000 0.000
1 0.986 0.948 0.819 0.652 0.475 0.313 0.179 0.084 0.027 0.004 0.000
2 1.000 0.996 0.973 0.916 0.821 0.688 0.525 0.348 0.181 0.052 0.014
3 1.000 1.000 0.998 0.992 0.974 0.938 0.870 0.760 0.590 0.344 0.185
4 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
n=5 0 0.774 0.590 0.328 0.168 0.078 0.031 0.010 0.002 0.000 0.000 0.000
1 0.977 0.919 0.737 0.528 0.337 0.188 0.087 0.031 0.007 0.000 0.000
2 0.999 0.991 0.942 0.837 0.683 0.500 0.317 0.163 0.058 0.009 0.001
3 1.000 1.000 0.993 0.969 0.913 0.813 0.663 0.472 0.263 0.081 0.023
4 1.000 1.000 1.000 0.998 0.990 0.969 0.922 0.832 0.672 0.410 0.226
5 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
n=6 0 0.735 0.531 0.262 0.118 0.047 0.016 0.004 0.001 0.000 0.000 0.000
1 0.967 0.886 0.655 0.420 0.233 0.109 0.041 0.011 0.002 0.000 0.000
2 0.998 0.984 0.901 0.744 0.544 0.344 0.179 0.070 0.017 0.001 0.000
3 1.000 0.999 0.983 0.930 0.821 0.656 0.456 0.256 0.099 0.016 0.002
4 1.000 1.000 0.998 0.989 0.959 0.891 0.767 0.580 0.345 0.114 0.033
5 1.000 1.000 1.000 0.999 0.996 0.984 0.953 0.882 0.738 0.469 0.265
6 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
n=7 0 0.698 0.478 0.210 0.082 0.028 0.008 0.002 0.000 0.000 0.000 0.000
1 0.956 0.850 0.577 0.329 0.159 0.063 0.019 0.004 0.000 0.000 0.000
2 0.996 0.974 0.852 0.647 0.420 0.227 0.096 0.029 0.005 0.000 0.000
3 1.000 0.997 0.967 0.874 0.710 0.500 0.290 0.126 0.033 0.003 0.000
4 1.000 1.000 0.995 0.971 0.904 0.773 0.580 0.353 0.148 0.026 0.004
5 1.000 1.000 1.000 0.996 0.981 0.938 0.841 0.671 0.423 0.150 0.044
6 1.000 1.000 1.000 1.000 0.998 0.992 0.972 0.918 0.790 0.522 0.302
7 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
n=8 0 0.663 0.430 0.168 0.058 0.017 0.004 0.001 0.000 0.000 0.000 0.000
1 0.943 0.813 0.503 0.255 0.106 0.035 0.009 0.001 0.000 0.000 0.000
2 0.994 0.962 0.797 0.552 0.315 0.145 0.050 0.011 0.001 0.000 0.000
3 1.000 0.995 0.944 0.806 0.594 0.363 0.174 0.058 0.010 0.000 0.000
4 1.000 1.000 0.990 0.942 0.826 0.637 0.406 0.194 0.056 0.005 0.000
5 1.000 1.000 0.999 0.989 0.950 0.855 0.685 0.448 0.203 0.038 0.006
6 1.000 1.000 1.000 0.999 0.991 0.965 0.894 0.745 0.497 0.187 0.057
7 1.000 1.000 1.000 1.000 0.999 0.996 0.983 0.942 0.832 0.570 0.337
8 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
n=9 0 0.630 0.387 0.134 0.040 0.010 0.002 0.000 0.000 0.000 0.000 0.000
1 0.929 0.775 0.436 0.196 0.071 0.020 0.004 0.000 0.000 0.000 0.000
2 0.992 0.947 0.738 0.463 0.232 0.090 0.025 0.004 0.000 0.000 0.000
3 0.999 0.992 0.914 0.730 0.483 0.254 0.099 0.025 0.003 0.000 0.000
4 1.000 0.999 0.980 0.901 0.733 0.500 0.267 0.099 0.020 0.001 0.000
5 1.000 1.000 0.997 0.975 0.901 0.746 0.517 0.270 0.086 0.008 0.001
6 1.000 1.000 1.000 0.996 0.975 0.910 0.768 0.537 0.262 0.053 0.008
7 1.000 1.000 1.000 1.000 0.996 0.980 0.929 0.804 0.564 0.225 0.071
8 1.000 1.000 1.000 1.000 1.000 0.998 0.990 0.960 0.866 0.613 0.370
9 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
n = 10 0 0.599 0.349 0.107 0.028 0.006 0.001 0.000 0.000 0.000 0.000 0.000
1 0.914 0.736 0.376 0.149 0.046 0.011 0.002 0.000 0.000 0.000 0.000
2 0.988 0.930 0.678 0.383 0.167 0.055 0.012 0.002 0.000 0.000 0.000
3 0.999 0.987 0.879 0.650 0.382 0.172 0.055 0.011 0.001 0.000 0.000
4 1.000 0.998 0.967 0.850 0.633 0.377 0.166 0.047 0.006 0.000 0.000
5 1.000 1.000 0.994 0.953 0.834 0.623 0.367 0.150 0.033 0.002 0.000
6 1.000 1.000 0.999 0.989 0.945 0.828 0.618 0.350 0.121 0.013 0.001
7 1.000 1.000 1.000 0.998 0.988 0.945 0.833 0.617 0.322 0.070 0.012
8 1.000 1.000 1.000 1.000 0.998 0.989 0.954 0.851 0.624 0.264 0.086
9 1.000 1.000 1.000 1.000 1.000 0.999 0.994 0.972 0.893 0.651 0.401
10 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
n = 11 0 0.569 0.314 0.086 0.020 0.004 0.000 0.000 0.000 0.000 0.000 0.000
1 0.898 0.697 0.322 0.113 0.030 0.006 0.001 0.000 0.000 0.000 0.000
2 0.985 0.910 0.617 0.313 0.119 0.033 0.006 0.001 0.000 0.000 0.000
3 0.998 0.981 0.839 0.570 0.296 0.113 0.029 0.004 0.000 0.000 0.000
4 1.000 0.997 0.950 0.790 0.533 0.274 0.099 0.022 0.002 0.000 0.000
5 1.000 1.000 0.988 0.922 0.753 0.500 0.247 0.078 0.012 0.000 0.000
6 1.000 1.000 0.998 0.978 0.901 0.726 0.467 0.210 0.050 0.003 0.000
7 1.000 1.000 1.000 0.996 0.971 0.887 0.704 0.430 0.161 0.019 0.002
8 1.000 1.000 1.000 0.999 0.994 0.967 0.881 0.687 0.383 0.090 0.015
9 1.000 1.000 1.000 1.000 0.999 0.994 0.970 0.887 0.678 0.303 0.102
10 1.000 1.000 1.000 1.000 1.000 1.000 0.996 0.980 0.914 0.686 0.431
11 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
n = 12 0 0.540 0.282 0.069 0.014 0.002 0.000 0.000 0.000 0.000 0.000 0.000
1 0.882 0.659 0.275 0.085 0.020 0.003 0.000 0.000 0.000 0.000 0.000
2 0.980 0.889 0.558 0.253 0.083 0.019 0.003 0.000 0.000 0.000 0.000
3 0.998 0.974 0.795 0.493 0.225 0.073 0.015 0.002 0.000 0.000 0.000
4 1.000 0.996 0.927 0.724 0.438 0.194 0.057 0.009 0.001 0.000 0.000
5 1.000 0.999 0.981 0.882 0.665 0.387 0.158 0.039 0.004 0.000 0.000
6 1.000 1.000 0.996 0.961 0.842 0.613 0.335 0.118 0.019 0.001 0.000
7 1.000 1.000 0.999 0.991 0.943 0.806 0.562 0.276 0.073 0.004 0.000
8 1.000 1.000 1.000 0.998 0.985 0.927 0.775 0.507 0.205 0.026 0.002
9 1.000 1.000 1.000 1.000 0.997 0.981 0.917 0.747 0.442 0.111 0.020
10 1.000 1.000 1.000 1.000 1.000 0.997 0.980 0.915 0.725 0.341 0.118
11 1.000 1.000 1.000 1.000 1.000 1.000 0.998 0.986 0.931 0.718 0.460
12 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
n = 13 0 0.513 0.254 0.055 0.010 0.001 0.000 0.000 0.000 0.000 0.000 0.000
1 0.865 0.621 0.234 0.064 0.013 0.002 0.000 0.000 0.000 0.000 0.000
2 0.975 0.866 0.502 0.202 0.058 0.011 0.001 0.000 0.000 0.000 0.000
3 0.997 0.966 0.747 0.421 0.169 0.046 0.008 0.001 0.000 0.000 0.000
4 1.000 0.994 0.901 0.654 0.353 0.133 0.032 0.004 0.000 0.000 0.000
5 1.000 0.999 0.970 0.835 0.574 0.291 0.098 0.018 0.001 0.000 0.000
6 1.000 1.000 0.993 0.938 0.771 0.500 0.229 0.062 0.007 0.000 0.000
7 1.000 1.000 0.999 0.982 0.902 0.709 0.426 0.165 0.030 0.001 0.000
8 1.000 1.000 1.000 0.996 0.968 0.867 0.647 0.346 0.099 0.006 0.000
9 1.000 1.000 1.000 0.999 0.992 0.954 0.831 0.579 0.253 0.034 0.003
10 1.000 1.000 1.000 1.000 0.999 0.989 0.942 0.798 0.498 0.134 0.025
11 1.000 1.000 1.000 1.000 1.000 0.998 0.987 0.936 0.766 0.379 0.135
12 1.000 1.000 1.000 1.000 1.000 1.000 0.999 0.990 0.945 0.746 0.487
13 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
n = 14 0 0.488 0.229 0.044 0.007 0.001 0.000 0.000 0.000 0.000 0.000 0.000
1 0.847 0.585 0.198 0.047 0.008 0.001 0.000 0.000 0.000 0.000 0.000
2 0.970 0.842 0.448 0.161 0.040 0.006 0.001 0.000 0.000 0.000 0.000
3 0.996 0.956 0.698 0.355 0.124 0.029 0.004 0.000 0.000 0.000 0.000
4 1.000 0.991 0.870 0.584 0.279 0.090 0.018 0.002 0.000 0.000 0.000
5 1.000 0.999 0.956 0.781 0.486 0.212 0.058 0.008 0.000 0.000 0.000
6 1.000 1.000 0.988 0.907 0.692 0.395 0.150 0.031 0.002 0.000 0.000
7 1.000 1.000 0.998 0.969 0.850 0.605 0.308 0.093 0.012 0.000 0.000
8 1.000 1.000 1.000 0.992 0.942 0.788 0.514 0.219 0.044 0.001 0.000
9 1.000 1.000 1.000 0.998 0.982 0.910 0.721 0.416 0.130 0.009 0.000
10 1.000 1.000 1.000 1.000 0.996 0.971 0.876 0.645 0.302 0.044 0.004
11 1.000 1.000 1.000 1.000 0.999 0.994 0.960 0.839 0.552 0.158 0.030
12 1.000 1.000 1.000 1.000 1.000 0.999 0.992 0.953 0.802 0.415 0.153
13 1.000 1.000 1.000 1.000 1.000 1.000 0.999 0.993 0.956 0.771 0.512
14 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
n = 15 0 0.463 0.206 0.035 0.005 0.000 0.000 0.000 0.000 0.000 0.000 0.000
1 0.829 0.549 0.167 0.035 0.005 0.000 0.000 0.000 0.000 0.000 0.000
2 0.964 0.816 0.398 0.127 0.027 0.004 0.000 0.000 0.000 0.000 0.000
3 0.995 0.944 0.648 0.297 0.091 0.018 0.002 0.000 0.000 0.000 0.000
4 0.999 0.987 0.836 0.515 0.217 0.059 0.009 0.001 0.000 0.000 0.000
5 1.000 0.998 0.939 0.722 0.403 0.151 0.034 0.004 0.000 0.000 0.000
6 1.000 1.000 0.982 0.869 0.610 0.304 0.095 0.015 0.001 0.000 0.000
7 1.000 1.000 0.996 0.950 0.787 0.500 0.213 0.050 0.004 0.000 0.000
8 1.000 1.000 0.999 0.985 0.905 0.696 0.390 0.131 0.018 0.000 0.000
9 1.000 1.000 1.000 0.996 0.966 0.849 0.597 0.278 0.061 0.002 0.000
10 1.000 1.000 1.000 0.999 0.991 0.941 0.783 0.485 0.164 0.013 0.001
11 1.000 1.000 1.000 1.000 0.998 0.982 0.909 0.703 0.352 0.056 0.005
12 1.000 1.000 1.000 1.000 1.000 0.996 0.973 0.873 0.602 0.184 0.036
13 1.000 1.000 1.000 1.000 1.000 1.000 0.995 0.965 0.833 0.451 0.171
14 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.995 0.965 0.794 0.537
15 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
n = 16 0 0.440 0.185 0.028 0.003 0.000 0.000 0.000 0.000 0.000 0.000 0.000
1 0.811 0.515 0.141 0.026 0.003 0.000 0.000 0.000 0.000 0.000 0.000
2 0.957 0.789 0.352 0.099 0.018 0.002 0.000 0.000 0.000 0.000 0.000
3 0.993 0.932 0.598 0.246 0.065 0.011 0.001 0.000 0.000 0.000 0.000
4 0.999 0.983 0.798 0.450 0.167 0.038 0.005 0.000 0.000 0.000 0.000
5 1.000 0.997 0.918 0.660 0.329 0.105 0.019 0.002 0.000 0.000 0.000
6 1.000 0.999 0.973 0.825 0.527 0.227 0.058 0.007 0.000 0.000 0.000
7 1.000 1.000 0.993 0.926 0.716 0.402 0.142 0.026 0.001 0.000 0.000
8 1.000 1.000 0.999 0.974 0.858 0.598 0.284 0.074 0.007 0.000 0.000
9 1.000 1.000 1.000 0.993 0.942 0.773 0.473 0.175 0.027 0.001 0.000
10 1.000 1.000 1.000 0.998 0.981 0.895 0.671 0.340 0.082 0.003 0.000
11 1.000 1.000 1.000 1.000 0.995 0.962 0.833 0.550 0.202 0.017 0.001
12 1.000 1.000 1.000 1.000 0.999 0.989 0.935 0.754 0.402 0.068 0.007
13 1.000 1.000 1.000 1.000 1.000 0.998 0.982 0.901 0.648 0.211 0.043
14 1.000 1.000 1.000 1.000 1.000 1.000 0.997 0.974 0.859 0.485 0.189
15 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.997 0.972 0.815 0.560
16 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
n = 17 0 0.418 0.167 0.023 0.002 0.000 0.000 0.000 0.000 0.000 0.000 0.000
1 0.792 0.482 0.118 0.019 0.002 0.000 0.000 0.000 0.000 0.000 0.000
2 0.950 0.762 0.310 0.077 0.012 0.001 0.000 0.000 0.000 0.000 0.000
3 0.991 0.917 0.549 0.202 0.046 0.006 0.000 0.000 0.000 0.000 0.000
4 0.999 0.978 0.758 0.389 0.126 0.025 0.003 0.000 0.000 0.000 0.000
5 1.000 0.995 0.894 0.597 0.264 0.072 0.011 0.001 0.000 0.000 0.000
6 1.000 0.999 0.962 0.775 0.448 0.166 0.035 0.003 0.000 0.000 0.000
7 1.000 1.000 0.989 0.895 0.641 0.315 0.092 0.013 0.000 0.000 0.000
8 1.000 1.000 0.997 0.960 0.801 0.500 0.199 0.040 0.003 0.000 0.000
9 1.000 1.000 1.000 0.987 0.908 0.685 0.359 0.105 0.011 0.000 0.000
10 1.000 1.000 1.000 0.997 0.965 0.834 0.552 0.225 0.038 0.001 0.000
11 1.000 1.000 1.000 0.999 0.989 0.928 0.736 0.403 0.106 0.005 0.000
12 1.000 1.000 1.000 1.000 0.997 0.975 0.874 0.611 0.242 0.022 0.001
13 1.000 1.000 1.000 1.000 1.000 0.994 0.954 0.798 0.451 0.083 0.009
14 1.000 1.000 1.000 1.000 1.000 0.999 0.988 0.923 0.690 0.238 0.050
15 1.000 1.000 1.000 1.000 1.000 1.000 0.998 0.981 0.882 0.518 0.208
16 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.998 0.977 0.833 0.582
17 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
n = 18 0 0.397 0.150 0.018 0.002 0.000 0.000 0.000 0.000 0.000 0.000 0.000
1 0.774 0.450 0.099 0.014 0.001 0.000 0.000 0.000 0.000 0.000 0.000
2 0.942 0.734 0.271 0.060 0.008 0.001 0.000 0.000 0.000 0.000 0.000
3 0.989 0.902 0.501 0.165 0.033 0.004 0.000 0.000 0.000 0.000 0.000
4 0.998 0.972 0.716 0.333 0.094 0.015 0.001 0.000 0.000 0.000 0.000
5 1.000 0.994 0.867 0.534 0.209 0.048 0.006 0.000 0.000 0.000 0.000
6 1.000 0.999 0.949 0.722 0.374 0.119 0.020 0.001 0.000 0.000 0.000
7 1.000 1.000 0.984 0.859 0.563 0.240 0.058 0.006 0.000 0.000 0.000
8 1.000 1.000 0.996 0.940 0.737 0.407 0.135 0.021 0.001 0.000 0.000
9 1.000 1.000 0.999 0.979 0.865 0.593 0.263 0.060 0.004 0.000 0.000
10 1.000 1.000 1.000 0.994 0.942 0.760 0.437 0.141 0.016 0.000 0.000
11 1.000 1.000 1.000 0.999 0.980 0.881 0.626 0.278 0.051 0.001 0.000
12 1.000 1.000 1.000 1.000 0.994 0.952 0.791 0.466 0.133 0.006 0.000
13 1.000 1.000 1.000 1.000 0.999 0.985 0.906 0.667 0.284 0.028 0.002
14 1.000 1.000 1.000 1.000 1.000 0.996 0.967 0.835 0.499 0.098 0.011
15 1.000 1.000 1.000 1.000 1.000 0.999 0.992 0.940 0.729 0.266 0.058
16 1.000 1.000 1.000 1.000 1.000 1.000 0.999 0.986 0.901 0.550 0.226
17 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.998 0.982 0.850 0.603
18 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
n = 19 0 0.377 0.135 0.014 0.001 0.000 0.000 0.000 0.000 0.000 0.000 0.000
1 0.755 0.420 0.083 0.010 0.001 0.000 0.000 0.000 0.000 0.000 0.000
2 0.933 0.705 0.237 0.046 0.005 0.000 0.000 0.000 0.000 0.000 0.000
3 0.987 0.885 0.455 0.133 0.023 0.002 0.000 0.000 0.000 0.000 0.000
4 0.998 0.965 0.673 0.282 0.070 0.010 0.001 0.000 0.000 0.000 0.000
5 1.000 0.991 0.837 0.474 0.163 0.032 0.003 0.000 0.000 0.000 0.000
6 1.000 0.998 0.932 0.666 0.308 0.084 0.012 0.001 0.000 0.000 0.000
7 1.000 1.000 0.977 0.818 0.488 0.180 0.035 0.003 0.000 0.000 0.000
8 1.000 1.000 0.993 0.916 0.667 0.324 0.088 0.011 0.000 0.000 0.000
9 1.000 1.000 0.998 0.967 0.814 0.500 0.186 0.033 0.002 0.000 0.000
10 1.000 1.000 1.000 0.989 0.912 0.676 0.333 0.084 0.007 0.000 0.000
11 1.000 1.000 1.000 0.997 0.965 0.820 0.512 0.182 0.023 0.000 0.000
12 1.000 1.000 1.000 0.999 0.988 0.916 0.692 0.334 0.068 0.002 0.000
13 1.000 1.000 1.000 1.000 0.997 0.968 0.837 0.526 0.163 0.009 0.000
14 1.000 1.000 1.000 1.000 0.999 0.990 0.930 0.718 0.327 0.035 0.002
15 1.000 1.000 1.000 1.000 1.000 0.998 0.977 0.867 0.545 0.115 0.013
16 1.000 1.000 1.000 1.000 1.000 1.000 0.995 0.954 0.763 0.295 0.067
17 1.000 1.000 1.000 1.000 1.000 1.000 0.999 0.990 0.917 0.580 0.245
18 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.999 0.986 0.865 0.623
19 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
n = 20 0 0.358 0.122 0.012 0.001 0.000 0.000 0.000 0.000 0.000 0.000 0.000
1 0.736 0.392 0.069 0.008 0.001 0.000 0.000 0.000 0.000 0.000 0.000
2 0.925 0.677 0.206 0.035 0.004 0.000 0.000 0.000 0.000 0.000 0.000
3 0.984 0.867 0.411 0.107 0.016 0.001 0.000 0.000 0.000 0.000 0.000
4 0.997 0.957 0.630 0.238 0.051 0.006 0.000 0.000 0.000 0.000 0.000
5 1.000 0.989 0.804 0.416 0.126 0.021 0.002 0.000 0.000 0.000 0.000
6 1.000 0.998 0.913 0.608 0.250 0.058 0.006 0.000 0.000 0.000 0.000
7 1.000 1.000 0.968 0.772 0.416 0.132 0.021 0.001 0.000 0.000 0.000
8 1.000 1.000 0.990 0.887 0.596 0.252 0.057 0.005 0.000 0.000 0.000
9 1.000 1.000 0.997 0.952 0.755 0.412 0.128 0.017 0.001 0.000 0.000
10 1.000 1.000 0.999 0.983 0.872 0.588 0.245 0.048 0.003 0.000 0.000
11 1.000 1.000 1.000 0.995 0.943 0.748 0.404 0.113 0.010 0.000 0.000
12 1.000 1.000 1.000 0.999 0.979 0.868 0.584 0.228 0.032 0.000 0.000
13 1.000 1.000 1.000 1.000 0.994 0.942 0.750 0.392 0.087 0.002 0.000
14 1.000 1.000 1.000 1.000 0.998 0.979 0.874 0.584 0.196 0.011 0.000
15 1.000 1.000 1.000 1.000 1.000 0.994 0.949 0.762 0.370 0.043 0.003
16 1.000 1.000 1.000 1.000 1.000 0.999 0.984 0.893 0.589 0.133 0.016
17 1.000 1.000 1.000 1.000 1.000 1.000 0.996 0.965 0.794 0.323 0.075
18 1.000 1.000 1.000 1.000 1.000 1.000 0.999 0.992 0.931 0.608 0.264
19 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.999 0.988 0.878 0.642
20 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
n = 25 0 0.277 0.072 0.004 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
1 0.642 0.271 0.027 0.002 0.000 0.000 0.000 0.000 0.000 0.000 0.000
2 0.873 0.537 0.098 0.009 0.000 0.000 0.000 0.000 0.000 0.000 0.000
3 0.966 0.764 0.234 0.033 0.002 0.000 0.000 0.000 0.000 0.000 0.000
4 0.993 0.902 0.421 0.090 0.009 0.000 0.000 0.000 0.000 0.000 0.000
5 0.999 0.967 0.617 0.193 0.029 0.002 0.000 0.000 0.000 0.000 0.000
6 1.000 0.991 0.780 0.341 0.074 0.007 0.000 0.000 0.000 0.000 0.000
7 1.000 0.998 0.891 0.512 0.154 0.022 0.001 0.000 0.000 0.000 0.000
8 1.000 1.000 0.953 0.677 0.274 0.054 0.004 0.000 0.000 0.000 0.000
9 1.000 1.000 0.983 0.811 0.425 0.115 0.013 0.000 0.000 0.000 0.000
10 1.000 1.000 0.994 0.902 0.586 0.212 0.034 0.002 0.000 0.000 0.000
11 1.000 1.000 0.998 0.956 0.732 0.345 0.078 0.006 0.000 0.000 0.000
12 1.000 1.000 1.000 0.983 0.846 0.500 0.154 0.017 0.000 0.000 0.000
13 1.000 1.000 1.000 0.994 0.922 0.655 0.268 0.044 0.002 0.000 0.000
14 1.000 1.000 1.000 0.998 0.966 0.788 0.414 0.098 0.006 0.000 0.000
15 1.000 1.000 1.000 1.000 0.987 0.885 0.575 0.189 0.017 0.000 0.000
16 1.000 1.000 1.000 1.000 0.996 0.946 0.726 0.323 0.047 0.000 0.000
17 1.000 1.000 1.000 1.000 0.999 0.978 0.846 0.488 0.109 0.002 0.000
18 1.000 1.000 1.000 1.000 1.000 0.993 0.926 0.659 0.220 0.009 0.000
19 1.000 1.000 1.000 1.000 1.000 0.998 0.971 0.807 0.383 0.033 0.001
20 1.000 1.000 1.000 1.000 1.000 1.000 0.991 0.910 0.579 0.098 0.007
21 1.000 1.000 1.000 1.000 1.000 1.000 0.998 0.967 0.766 0.236 0.034
22 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.991 0.902 0.463 0.127
23 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.998 0.973 0.729 0.358
24 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.996 0.928 0.723
25 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
The table below gives the probability of that a Poisson random variable X with mean
= λ is less than or equal to x. That is, the table gives
𝑟𝑒 ̵
𝑃(𝑋 ≤ 𝑥) = 𝜆
𝑟!
λ= 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.2 1.4 1.6 1.8
x= 0 0.9048 0.8187 0.7408 0.6703 0.6065 0.5488 0.4966 0.4493 0.4066 0.3679 0.3012 0.2466 0.2019 0.1653
1 0.9953 0.9825 0.9631 0.9384 0.9098 0.8781 0.8442 0.8088 0.7725 0.7358 0.6626 0.5918 0.5249 0.4628
2 0.9998 0.9989 0.9964 0.9921 0.9856 0.9769 0.9659 0.9526 0.9371 0.9197 0.8795 0.8335 0.7834 0.7306
3 1.0000 0.9999 0.9997 0.9992 0.9982 0.9966 0.9942 0.9909 0.9865 0.9810 0.9662 0.9463 0.9212 0.8913
4 1.0000 1.0000 1.0000 0.9999 0.9998 0.9996 0.9992 0.9986 0.9977 0.9963 0.9923 0.9857 0.9763 0.9636
5 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.9999 0.9998 0.9997 0.9994 0.9985 0.9968 0.9940 0.9896
6 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.9999 0.9997 0.9994 0.9987 0.9974
7 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.9999 0.9997 0.9994
8 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.9999
9 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
λ= 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0 4.5 5.0 5.5
x=0 0.1353 0.1108 0.0907 0.0743 0.0608 0.0498 0.0408 0.0334 0.0273 0.0224 0.0183 0.0111 0.0067 0.0041
1 0.4060 0.3546 0.3084 0.2674 0.2311 0.1991 0.1712 0.1468 0.1257 0.1074 0.0916 0.0611 0.0404 0.0266
2 0.6767 0.6227 0.5697 0.5184 0.4695 0.4232 0.3799 0.3397 0.3027 0.2689 0.2381 0.1736 0.1247 0.0884
3 0.8571 0.8194 0.7787 0.7360 0.6919 0.6472 0.6025 0.5584 0.5152 0.4735 0.4335 0.3423 0.2650 0.2017
4 0.9473 0.9275 0.9041 0.8774 0.8477 0.8153 0.7806 0.7442 0.7064 0.6678 0.6288 0.5321 0.4405 0.3575
5 0.9834 0.9751 0.9643 0.9510 0.9349 0.9161 0.8946 0.8705 0.8441 0.8156 0.7851 0.7029 0.6160 0.5289
6 0.9955 0.9925 0.9884 0.9828 0.9756 0.9665 0.9554 0.9421 0.9267 0.9091 0.8893 0.8311 0.7622 0.6860
7 0.9989 0.9980 0.9967 0.9947 0.9919 0.9881 0.9832 0.9769 0.9692 0.9599 0.9489 0.9134 0.8666 0.8095
8 0.9998 0.9995 0.9991 0.9985 0.9976 0.9962 0.9943 0.9917 0.9883 0.9840 0.9786 0.9597 0.9319 0.8944
9 1.0000 0.9999 0.9998 0.9996 0.9993 0.9989 0.9982 0.9973 0.9960 0.9942 0.9919 0.9829 0.9682 0.9462
10 1.0000 1.0000 1.0000 0.9999 0.9998 0.9997 0.9995 0.9992 0.9987 0.9981 0.9972 0.9933 0.9863 0.9747
11 1.0000 1.0000 1.0000 1.0000 1.0000 0.9999 0.9999 0.9998 0.9996 0.9994 0.9991 0.9976 0.9945 0.9890
12 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.9999 0.9999 0.9998 0.9997 0.9992 0.9980 0.9955
13 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.9999 0.9997 0.9993 0.9983
14 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.9999 0.9998 0.9994
15 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.9999 0.9998
16 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.9999
17 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
λ= 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 11.0 10.0 12.0 14.0 15.0
x=0 0.0025 0.0015 0.0009 0.0006 0.0003 0.0002 0.0001 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
1 0.0174 0.0113 0.0073 0.0047 0.0030 0.0019 0.0012 0.0008 0.0005 0.0002 0.0005 0.0001 0.0000 0.0000
2 0.0620 0.0430 0.0296 0.0203 0.0138 0.0093 0.0062 0.0042 0.0028 0.0012 0.0028 0.0005 0.0001 0.0000
3 0.1512 0.1118 0.0818 0.0591 0.0424 0.0301 0.0212 0.0149 0.0103 0.0049 0.0103 0.0023 0.0005 0.0002
4 0.2851 0.2237 0.1730 0.1321 0.0996 0.0744 0.0550 0.0403 0.0293 0.0151 0.0293 0.0076 0.0018 0.0009
5 0.4457 0.3690 0.3007 0.2414 0.1912 0.1496 0.1157 0.0885 0.0671 0.0375 0.0671 0.0203 0.0055 0.0028
6 0.6063 0.5265 0.4497 0.3782 0.3134 0.2562 0.2068 0.1649 0.1301 0.0786 0.1301 0.0458 0.0142 0.0076
7 0.7440 0.6728 0.5987 0.5246 0.4530 0.3856 0.3239 0.2687 0.2202 0.1432 0.2202 0.0895 0.0316 0.0180
8 0.8472 0.7916 0.7291 0.6620 0.5925 0.5231 0.4557 0.3918 0.3328 0.2320 0.3328 0.1550 0.0621 0.0374
9 0.9161 0.8774 0.8305 0.7764 0.7166 0.6530 0.5874 0.5218 0.4579 0.3405 0.4579 0.2424 0.1094 0.0699
10 0.9574 0.9332 0.9015 0.8622 0.8159 0.7634 0.7060 0.6453 0.5830 0.4599 0.5830 0.3472 0.1757 0.1185
11 0.9799 0.9661 0.9467 0.9208 0.8881 0.8487 0.8030 0.7520 0.6968 0.5793 0.6968 0.4616 0.2600 0.1848
12 0.9912 0.9840 0.9730 0.9573 0.9362 0.9091 0.8758 0.8364 0.7916 0.6887 0.7916 0.5760 0.3585 0.2676
13 0.9964 0.9929 0.9872 0.9784 0.9658 0.9486 0.9261 0.8981 0.8645 0.7813 0.8645 0.6815 0.4644 0.3632
14 0.9986 0.9970 0.9943 0.9897 0.9827 0.9726 0.9585 0.9400 0.9165 0.8540 0.9165 0.7720 0.5704 0.4657
15 0.9995 0.9988 0.9976 0.9954 0.9918 0.9862 0.9780 0.9665 0.9513 0.9074 0.9513 0.8444 0.6694 0.5681
16 0.9998 0.9996 0.9990 0.9980 0.9963 0.9934 0.9889 0.9823 0.9730 0.9441 0.9730 0.8987 0.7559 0.6641
17 0.9999 0.9998 0.9996 0.9992 0.9984 0.9970 0.9947 0.9911 0.9857 0.9678 0.9857 0.9370 0.8272 0.7489
18 1.0000 0.9999 0.9999 0.9997 0.9993 0.9987 0.9976 0.9957 0.9928 0.9823 0.9928 0.9626 0.8826 0.8195
19 1.0000 1.0000 1.0000 0.9999 0.9997 0.9995 0.9989 0.9980 0.9965 0.9907 0.9965 0.9787 0.9235 0.8752
20 1.0000 1.0000 1.0000 1.0000 0.9999 0.9998 0.9996 0.9991 0.9984 0.9953 0.9984 0.9884 0.9521 0.9170
21 1.0000 1.0000 1.0000 1.0000 1.0000 0.9999 0.9998 0.9996 0.9993 0.9977 0.9993 0.9939 0.9712 0.9469
22 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.9999 0.9999 0.9997 0.9990 0.9997 0.9970 0.9833 0.9673
23 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.9999 0.9999 0.9995 0.9999 0.9985 0.9907 0.9805
24 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.9998 1.0000 0.9993 0.9950 0.9888
25 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.9999 1.0000 0.9997 0.9974 0.9938
26 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.9999 0.9987 0.9967
27 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.9999 0.9994 0.9983
28 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.9997 0.9991
29 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.9999 0.9996
30 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.9999 0.9998
31 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.9999
32 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
References:
7. Sampling
A sample is meant to represent the population. The idea is that by studying a
properly selected sample, researchers can make valid inferences about the larger
population. Statistical methods are used to calculate how well the sample represents
the population and to estimate the margin of error and confidence levels.
The population refers to the entire set of individuals, items, or data points that are of
interest in a particular study. A population includes every possible member or unit
that fits the criteria being studied. Suppose a pharmaceutical company develops a new
drug intended to treat diabetes in adults aged 60-70 years. The population would
include all adults with diabetes hypertension in that age range who are eligible for the
drug.
Sampling refers to the process of selecting a subset from a larger population or batch
in order to make inferences about the entire population, batch, or process. Effective
sampling is a cornerstone of pharmaceutical research, as it ensures that results are
reliable, reproducible, and meaningful. In quality control, a sample could refer to a
subset of 6 tablets selected from a batch of 100,000 tablets for testing disintegration
time.
The size of the sample is an essential factor in determining the reliability and validity
of conclusions drawn from the data. A large sample size refers to a study or research
that involves a large number of units from the population being studied. In most
studies, large refers to a sample size typically in the range of hundreds to thousands,
depending on the nature of the population. In a clinical trial evaluating a new drug, a
large sample size is chosen to ensure the study has enough power to detect any
differences between the drug and a placebo, a small sample size refers to studies that
involve fewer units, typically ranging from less than 30 to a few hundred. It may vary
based on the type of study, small sample sizes are common in preformulation studies,
early-phase clinical trials, pilot studies, or initial research where the main goal is to
get preliminary data.
Both large and small sample sizes have their place in pharmaceutical research. Large
sample sizes are crucial for ensuring robust, reliable, and generalizable results,
especially in pivotal clinical trials, while small sample sizes are typically used in
early-stage studies for safety, tolerability, or feasibility assessments. An appropriate
sample size is selected based on the research objectives, study phase, budget, and
ethical considerations.
Various probability and non probability sampling techniques are furnished in the
following flow chart.
Sampling Techniques
Example: A pharmaceutical company wants to test a new medication for diabetes. The
population for the study is a group of patients with diagnosed diabetes, aged 40-60.
Using SRS, the researchers would randomly select a group of participants from a list
of all patients who meet the exclusion and inclusion criteria. Patients could be
randomly chosen by assigning numbers to each eligible patient and selecting a sample
using random number generators or a lottery system.
𝑁
𝑘=
𝑛
Where:
Select a random number between 1 and k. This will be the starting point in the
population list. If the random number generated is r, then start by selecting the
element in the r-th position.
After selecting the first individual (starting point), every k-th individual is selected.
The subsequent elements are:
r, r+k,r+2k,r+3k,…
Select a random number between 1 and 10. After randomly selecting a starting point
(e.g., patient #5 on the list), the researchers would select every 10th patient on the list
to be included in the study (e.g., #5, #15, #25, etc.).
subgroups are properly represented in the final sample. Stratified sampling improves
precision by ensuring that all relevant subgroups are represented in the sample, which
is especially important when those subgroups may respond differently to treatment.
This method requires knowledge of the strata and may involve additional planning
and coordination to implement.
7.4.1.4 Cluster Sampling: Cluster Sampling involves dividing the population into
clusters (typically based on geographic location or other groupings), and then
randomly selecting a subset of clusters. Within those selected clusters, all individuals
or a random sample of individuals are used for further study. This method is cost-
effective for large-scale studies, particularly when the population is dispersed across a
large area. It can result in less precise estimates, especially if the selected clusters
differ significantly from each other.
multiple hospitals across the country. The researchers first select a random sample of
hospitals (clusters). Then, within each hospital, they randomly select a subset of
cancer patients to participate in the study.
Example: In a drug development study for rare diseases, researchers might only
recruit patients who have been diagnosed with that rare disease, as these are the most
relevant to the study.
7.4.2.3: Quota Sampling: Quota sampling involves dividing the population into
subgroups (or quotas) based on certain characteristics (e.g., age, gender, disease
stage). Researchers then sample participants from these quotas non-randomly until the
desired number of participants for each subgroup is reached. The advantages includes
that it ensures that specific subgroups are represented in the sample, faster than
probability sampling methods when quotas are clearly defined and it allows
researchers to ensure certain characteristics (e.g., age, gender) are proportionally
represented. The limitations are the selection within each subgroup is not random and
the results may be biased and not generalizable. Some subgroups may still be under-
represented or overlooked.
7.4.2.4: Snowball Sampling: Snowball sampling is used when the population is hard
to reach or lacks a clear listing. It begins with a small number of participants who
meet the study criteria, and then those participants refer others who meet the criteria.
This referral process "snowballs," with each new participant providing contacts for
additional participants. This technique is useful for studying populations that are
difficult to access or identify, such as people with rare conditions or marginalized
groups. It may help find hard-to-reach participants who might not otherwise be
included in traditional studies. The sample may become homogenous, as participants
tend to refer others with similar characteristics. It may not be representative of the
general population, leading to poor generalizability. The process can perpetuate
existing social networks or groups, which may not be ideal for diverse perspectives.
Example: In researching a rare condition, researchers may start with a few patients
who have been diagnosed with the disease and then ask them to refer other patients
they know who also have the condition. This is especially useful for researching rare
diseases or drug side effects that may not be easily detectable in the general
population.
willingness. Its quick and cost-effective.and may result in high participation rates if
the study is marketed well or provides incentives. This technique has limited
generalizability to the broader population due to the self-selection bias.
o Final Product Testing: The final drug product is sampled to check its
stability, dosage uniformity, and other critical parameters, ensuring the
drug meets regulatory standards.
o Stability Studies: Samples from different batches are stored under
various conditions to assess stability of the drug over time, guiding
expiry date labelling.
2. Clinical Trial Sampling:
o Patient Recruitment: Ensures that a diverse set of patients are included,
which is important for understanding how the drug will work across
different populations.
o Blinding and Randomization: Random sampling is crucial in ensuring
that blinding (masking treatment allocation) and randomization is
carried out to avoid bias.
o Placebo-Controlled Trials: Samples are often divided into experimental
(drug) and control (placebo) groups for comparison.
3. Stability and Shelf-Life Testing:
o Sampling is performed at different time points (e.g., 1 month, 6
months, 12 months) to ensure the drug retains its potency and is safe
for consumption throughout its shelf life.
4. Regulatory Compliance:
o Sampling in pharmaceutical research is highly regulated by agencies
Sampling plans must often adhere to Good Manufacturing Practices
(GMP) and Good Clinical Practices (GCP) to ensue the sampling
methods is scientifically sound and reproducible.
Alpha (α): The significance level typically set at 0.05 for a 95% confidence level. It
represents the probability of committing a Type I error (i.e., rejecting the null
hypothesis when it is true).
Beta (β): The probability of committing a Type II error (i.e., failing to reject the null
hypothesis when it is false). The power of the study is calculated as 1−β, and
researchers typically aim for a power of 80% or 90%.
Effect Size (d or δ): Effect size is the magnitude of the expected difference to
observe between the groups. A larger effect size requires a smaller sample to detect,
whereas a smaller effect size requires a larger sample.
Standard Deviation (σ): A higher variability requires a larger sample size to detect a
given effect.
Population Size (N): The total number of individuals in the population from which
the sample will be drawn.
When estimating a single mean (e.g., the average effect of a drug on a continuous
variable), the sample size formula is simpler.
Formula:
The formula for sample size estimation to estimate the population mean with a
specified confidence level and margin of error is:
𝑍𝛼
∗𝜎
𝑛= 2
𝐸
Where:
n = sample size.
Zα/2 = Z-score for the desired confidence level.
σ = Standard deviation of the population.
E = Margin of error (how precise you want the estimate to be).
Example: Estimate the sample size required to estimate the mean blood pressure
reduction after administering a new drug, with a 95% confidence level, margin of
error of 2 mmHg, and assuming a standard deviation of 8 mmHg.
Approximately 62 participants.
G*Power
**PASS (Power Analysis and Sample
Sampling errors occur when the sample selected does not perfectly represent the
entire population, leading to bias or inaccuracies in estimating parameters such as the
effect of a drug, the prevalence of a condition, or the safety profile of a treatment.
Understanding and minimizing sampling errors is crucial for ensuring that research
results are scientifically sound and applicable to the broader population.
Sampling errors generally fall into two broad categories: random errors and
systematic errors.
7.6.1. Random Sampling Error: Random sampling error occurs due to the inherent
variability in selecting a sample from a population. Even with proper random
sampling techniques, the sample may differ from the population by chance alone.
These errors typically reduce the precision of the estimates but do not introduce
systematic bias.
Example: In a clinical trial testing a new medication, random sampling error might
result in a slightly higher or lower proportion of patients experiencing side effects
in the trial compared to the general population, purely due to random variation in
the sample.
7.6. 2. Systematic Sampling Error (Bias): Systematic sampling error occurs when
the method of selecting the sample introduces bias, causing the sample to
consistently overestimate or underestimate the true population characteristics.
This type of error is not due to random chance but due to flaws in the sampling
design or methodology. Systematic errors are more dangerous than random errors
because they bias the results in a consistent direction, which could lead to false
conclusions about drug safety or efficacy.
1. Selection Bias:
This occurs when the sample is not representative of the population due to how
participants are selected. For example, if only patients from one type of healthcare
facility are recruited, the sample may not represent the broader population.
Example: A clinical trial that only recruits patients from large urban centers may
not reflect the experiences of patients in rural or underserved areas .
2. Non-Response Bias:
3. Exclusion Bias:
Example: If pregnant women are excluded from a drug trial due to safety concerns,
the results may not be applicable to pregnant individuals.
4. Recall Bias:
In retrospective studies, where participants are asked to recall past events (e.g.,
side effects of a drug), their memories may be inaccurate or selective, leading to
systematic errors.
To reduce sampling errors and ensure valid and reliable results in pharmaceutical
research, the following steps can be taken:
1. Careful Study Design: Ensure the study design is robust, using appropriate
randomization, sampling methods, and inclusion/exclusion criteria.
2. Use of Proper Sampling Techniques:
o Random sampling: Ensures that every individual in the population has
an equal chance of being selected, reducing bias.
o Stratified sampling: Ensures that subgroups are adequately represented
in the sample.
3. Increase Sample Size: Larger samples reduce random error and improve the
precision of estimates, making the study more powerful.
4. Ensure a Comprehensive Sampling Frame: Ensure that the list or frame from
which participants are selected accurately represents the target population.
5. Minimize Non-Response: Encourage participation to reduce the risk of non-
response bias, which can distort findings.
6. Regular Monitoring and Auditing: Continuous oversight during the study can
help detect and address potential sources of error early.
References
2. Rahman MM. Sample size determination for survey research and non-
probability sampling techniques: A review and set of recommendations.
Journal of Entrepreneurship, Business and Economics. 2023;11(1):42-62.
5. Das BK, Jha DN, Sahu SK, Yadav AK, Raman RK, Kartikeyan M. Concept of
sampling methodologies and their applications. InConcept building in fisheries
data analysis 2022 (pp. 17-40). Singapore: Springer Nature Singapore.
12. Chelton DB. Effects of sampling errors in statistical estimation. Deep Sea
Research Part A. Oceanographic Research Papers. 1983;30(10):1083-103.
8.Hypothesis Testing
The null hypothesis statement related to the antihypertensive effect observed from
a drug and placebo is expressed as the means the mean reduction in blood
pressure for patients using the drug (μ drug) is equal to the mean reduction in
blood pressure for patients using the placebo (μ placebo).
8.1.2. Alternative Hypothesis (H1or Ha): This is the hypothesis that suggests a
potential effect, difference, or relationship.
This means the averagedecreases in blood pressure for patients using the drug (μ
drug) is not equal to the mean reduction in blood pressure for patients using the
placebo (μ placebo).
This implies that the average decrease in blood pressure for patients taking the
medicament (μ drug) not the same as the average decrease in blood pressure for
those using the placebo (μ placebo).
1. Test Statistic: This refers to a standardized value derived from sample data
during a hypothesis test. It helps determine whether to accept or reject the null
hypothesis. Examples of test statistics include the Z-score, T-score, chi-
squared value, among others.
2. Significance Level (α): This is the probability threshold used to assess whether
the test outcome is statistically significant.
The significance levelrepresents the likelihood of rejecting the null hypothesis when it
is actually true, also known as a Type I error. In differentterms, it reflects risk of
mistakenly concluding that there is an effect or difference when, in fact, there is none.
Common Values of α: The most common value used for α in pharmaceutical and
clinical research is 0.05, which means there is a 5% chance of committing a Type I
error.The results are considered statistically significant if the p-value is less than 0.05,
suggesting a less than 5% probability that the observed result is due to random
chance.In pharmaceutical hypothesis testing, the significance level is a critical
parameter for deciding whether the evidence from a clinical trial or experiment is
strong enough to reject the null hypothesis and claim that a medication has a
meaningful effect.
Decreasing α lowers the risk the likelihood of Type I errors but raises the risk
of Type II errors (failing to identify a true effect).
Increasing α makes it easier to identify a true effect (increasing power) but
also heightens the likelihood of a Type I error.
A Type I Error would occur if, based on the sample data, the company rejected the
null hypothesis(i.e., concluded the new drug is more effective than the placebo), when
in fact, the drug has no effect. This means the company would wrongly believe that
the drug is more effective than the placebo, even though it not being the case. In a
pharmaceutical field, both type I and type II Error can lead to severe consequences,
such as:
5. Type II Error (False Negative): This occurs when the null hypothesis is not
rejected when it is actually false. The probability of making a Type II error is
denoted by β.
1. Small sample size: Smaller sample sizes lead to more variability and less
power to detect a true effect, raising the likelihood of a Type II error occur.
2. High variability in the data: If the data observed varies widely between
participants, it becomes harder to identify a meaningfulvariance.
3. Small effect size: when the true variation between the drug and placebo is
minimal, it will be harder for the test to detect the difference, especially
withsmall sample size.
Increase the sample size: Larger sample sizes reduce the standard error and
improving the ability to identify small effects.
Increase the significance level (α): Increasing α (e.g., ranging between 0.01 to
0.05) makes it easier to reject the null hypothesis, though it also increases the
risk of Type I errors.
Use more precise measurements: Reducing variability in the data (e.g.,
through better measurement techniques) enhances the test’s power.
Conduct a power analysis before experiment to ensure that study is designed
with enough power to detect an expected effect.
In pharmaceutical hypothesis testing, power is a critical concept that helps assess the
study’s ability to identify a true genuine effect when it is present. The power of a test
is the probability that it will accurately reject the null hypothesis when a specific
alternative hypothesis is valid. In simpler terms, it’s the likelihood of identifying a
real treatment effect if it indeed exists.
Where:
β (Beta) is likelihood of making a Type II error, which occurs when the test
fails to reject the null hypothesis even though the alternative hypothesis is true
(i.e., failing to identify a genuine effect).
The power of a hypothesis test is impacted by several factors, all of which must be
considered during the design of pharmaceutical studies:
Larger sample sizestypically enhance the power of a test. This is because increased
sample size reduces the standard error, making it easier to identify a true effect. If the
data is collected from more samples, then the more precise the estimate of the
treatment effect, improving the likelihood of rejecting the null hypothesis when it is
false.
Effect sizerefers to the magnitude of the difference between the null hypothesis value
and the true value of the parameter under the alternative hypothesis (e.g., actual
treatment effect). Larger effects are easier to detect and thus increase the power of the
test. If a drug produces a large therapeutic effect, the power to detect this effect will
be higher than if the effect is small.
2. Variance (σ²):
3. Study Design:
Example:
If a pharmaceutical company wants to test whether a new drug lowers blood pressure
more effectively than a placebo, they may use a power analysis to estimate the
number of patients required to detect a statistically significant difference with high
power, given expected effect sizes, variability, and α level.
Effect Size: The expected difference in the reduction of heart attack risk
between the drug and placebo is 10%.
Variance: The variability in heart attack risk reduction within each group is
assumed to be 5%.
Using these parameters, the company would calculate the minimum sample
sizeneeded to achieve an 80% power to identify the 10% difference in heart attack
risk reduction. If the required sample size is too large for practical reasons (e.g., cost,
time), the company may adjust other parameters (e.g., effect size, α) to ensure that the
study has sufficient power.
Formula for calculating sample size for a t-test with two independent samples:
2 + 𝑍𝛽 𝜎
𝑛=
∆
Where:
=2(2.8) X0.05/0.01
=0.28/0.01
=28
Software Tools:
There are several tools and software packages that can help with these calculations,
including:
8.3Types of Tests:
1. One-tailed Test:
The alternative hypothesis specifies a direction (greater than or less than). It examines
the potential for an effect in a single direction.
improvement, you would not reject the null hypothesis, as you're only testing for
greater effectiveness, not equivalence or lesser effectiveness.One-tailed tests are used
when you are confident in the direction of the effect (e.g., a new drug should be
better, not worse). One-tailed tests typically require smaller sample sizes to achieve
the same power because they focus on only one direction of the distribution,
increasing the likelihood of detecting an effect in that direction.
Null Hypothesis (H₀): The new vaccine is no more effective than the placebo in
generating immunity (i.e., the immune response is the same or worse than the
placebo).
Alternative Hypothesis (H₁): The new vaccine is more effective than the placebo
in generating immunity.
2. Two-tailed Test:
o The alternative hypothesis does not specify a direction (just a difference).
It tests for the possibility of an effect in either direction.
In pharmaceutical research, a two-tailed test is used when the hypothesis does not
predict the direction of the effect. It is appropriate when researchers are interested in
detecting differences in both directions, such as whether a new drug is either more or
less effective than the current treatment, or whether a drug has side effects that are
either worse or better than the alternative. A two-tailed test is typically used in cases
where the expectation is not directional or when both outcomes are of interest.Two-
tailed tests allow for an unbiased approach to evaluating the data, as they are used
when there is no preconceived idea about whether the new treatment will perform
better or worse than the existing treatment.
Example: A new vaccine is tested against placebo to find out if it provides immunity
against a particular virus.
NullHypothesis (H₀): The new vaccine is as effective as the placebo in
providing immunity.
References:
1. Pugh SL, Molinaro A. The nuts and bolts of hypothesis testing. Neuro-
Oncology Practice. 2016; 1;3(3):139-44.
2. Chakrabarty D. Measuremental Data: Seven Measures of Central Tendency.
International Journal of Electronics. 2021;8(1):15-24.
3. Parker C, Scott S, Geddes A. Snowball sampling. SAGE research methods
foundations. 2019.
4. Gogtay NJ, Thatte UM. Principles of correlation analysis. Journal of the
Association of Physicians of India. 2017;65(3):78-81.
5. Raftery AE, Gilks WR, Richardson S, Spiegelhalter D. Hypothesis testing and
model. Markov chain Monte Carlo in practice. 1995 Dec 1;1:165-87.
6. Klein JP, Moeschberger ML, Klein JP, Moeschberger ML. Hypothesis testing.
Survival analysis: techniques for censored and truncated data. 2003:201-42.
9. Parametric Tests
Statistical tests are used to analyze data and make inferences about populations based
on samples. In the field of statistics, parametric and nonparametric tests are two
different approaches used to analyze data and draw conclusions. Parametric and
nonparametric tests represent different approaches to statistical analysis. Parametric
tests make specific assumptions about the underlying population distribution, while
nonparametric tests do not rely on these assumptions. Understanding the differences
between these two types of tests is crucial for appropriate data analysis and drawing
accurate conclusions.
Parametric tests are statistical tests that make certain assumptions about the
underlying population from which the data are drawn. They are widely used in
statistical analysis due to their effectiveness and ability to provide rich, interpretable
results. Parametric tests are preferred if the data is better represented by mean. The
most common parametric tests are t-Test and Analysis of Variance (ANOVA). These
tests are mainly used for quantitative data and consist of the continuous variables.
Parametric tests generally have higher statistical power when the assumptions are met.
They can provide more precise estimates and narrower confidence intervals, making
them more sensitive to detecting small effects. It’s essential to check whether the data
meet the following necessary assumptions before applying these tests, as violations
can lead to incorrect conclusions.
3. Interval or ratio scale data: Parametric tests generally require the data to be
measured on at least an interval scale, or ratio scale, where both the order
and the exact difference between values are meaningful.
4. Independence: The observations should be independent of each other,
meaning the data from one subject or group should not influence another.
5. Outliers : The sample data don’t contain any extreme outliers
Advantages Disadvantages
More powerful than non-parametric tests Sensitive to violations of assumptions: If
when the assumptions are met, meaning the data is not normal or if variances are
they have a greater ability to detect a true not equal, the test results can be
effect. misleading.
Require larger sample sizes to reliably
Provide more detailed information, such
detect differences when assumptions are
as confidence intervals and effect sizes.
violated.
Choose between a z-test and a t-test by looking at the sample size and population
variance. It’s shown in the following fig 9.1.
Yes No
Z - Test T-Test
Z-Test
t test
The t-test is named after William Sealy Gosset’s Student’s t-distribution, a type of
inferential statistic test used to determine if there is a significant difference between
the means of two groups. It is often used when data is normally distributed and
population variance is unknown. This test is used in hypothesis testing to assess
whether the observed difference between the means of the two groups is statistically
significant or just due to random variation.
For large values of ν (i.e. increased sample size n); the t-distribution tends to a
standard normal distribution. This implies that for different ν values, the shape
of t-distribution also differs.
The t-distribution is less peaked than the normal distribution at the center and
higher peaked in the tails. The value of y(peak height) attains highest at μ = 0
as one can observe the same in the above diagram.
The variance is equal to ν / ν-2 for ν > 2 and ∞ for 2 < ν ≤ 4 otherwise
undefined.
t- distribution applications
2. Testing for the hypothesis of the difference between two means: t-tests can be
employed to examine if there is a significant difference between the means of
two independent samples. This can be done under the assumption of equal
variances or when variances are unequal. In scenarios where samples are not
independent, such as paired or dependent samples.
Assumptions in T-test
Types of t-tests
There are different types of t-tests depending on the research question and the design
of the study:
1. one-sample t-test
This test compares the mean of a sample to a known value (often the population
mean).
Null hypothesis (H₀): The sample mean is equal to the population mean.
Alternative hypothesis (H₁): The sample mean is different from the population
mean.
Null hypothesis (H₀): The means of the two groups are equal.
Alternative hypothesis (H₁): The means of the two groups are not equal.
This test is used when the two groups being compared are related, such as
measurements taken from the same subjects at different times (before and after
treatment).
t=Xˉ−μ/s /n
Where:
Xˉ = Sample mean (e.g., the mean dosage of active ingredient in the tablets).
μ = Population mean (e.g., the target dosage or potency).
s = Sample standard deviation (e.g., variability in the dosage among the
tablets).
n = Sample size (e.g., the number of tablets tested).
Using a t-distribution table, find the critical t-value based on the selected
significance level (α) and degrees of freedom (df).
If the absolute value of the calculated t-statistic is greater than the critical
value, reject the null hypothesis.
If the absolute value of the t-statistic is smaller than the critical value, fail to
reject the null hypothesis.
Step 6: Conclusion
t=│Xˉ−μ │s /n
│98−100│/2/20=−2/2/4.47=2/0.447=4.474
For α = 0.05 and df = 19, the critical t-value (for a two-tailed test) is approximately
2.093 (from the t-distribution table).
Step 6: Conclusion
Since the calculated t-statistic is more than the critical value, we reject the null
hypothesis. This indicates that the mean weight of the sample is significantly different
from the target dosage of 100 mg.
This test is used to compare the means of two independent groups to see if there is
evidence that the associated population means are significantly different from each
other. In pharmaceutical research, comparing the effects of two different treatments,
drugs, or therapies often involves determining whether there is a significant difference
between the means of two independent groups.
The formula for the t-statistic for two independent samples is:
Where:
The degrees of freedom (df) for the two-sample t-test can be calculated as:
df=n1+n2−2
Example: The body fat percentage data observed from different gender is furnished
below. Test the difference at 0.05 level of significance.
Null hypothesis is that the underlying population means are the same. The null
hypothesis is written as Ho: Xˉ1= Xˉ2
Alternative hypothesis is that the underlying population means are not same. The
alternative hypothesis is written as H1: Xˉ1≠ Xˉ2
√ 225 + 432 / 21
√ 657 / 21
√ 657 / 21
√ 31.2857
5.59
7/ 5.59(√1/10 + 1/13)
7/ 5.59(√0.1 + 0.077)
7/ 5.59(√0.177
7/ 5.59(√0.177
7/5.59x0.42
7/2.35
2.978
df=n1+n2−2
df=10+13−2
=21
The t value with α = 0.05 and 21 degrees of freedom is 2.080 (Table 9.1).
We compare the value of our statistic (2.978) to the t value. Since 2.978 > 2.080, we
reject the null hypothesis that the mean body fat for men and women are equal, and
conclude that we have evidence body fat in the population is different between men
and women.
The assumption of equal variances in the two groups being compared may not hold.
When this assumption is violated, the Welch’s t-test is often the preferred method for
comparing the means of two independent groups. Welch’s t-test is a variation of the
standard independent samples t-test that does not assume equal variances. The degrees
of freedom is computed with the following formulae
t= (Xˉ1−Xˉ2)/s12/n1+s22/n2
df=(s12/n1+s22/n2)2/(s12/n1)2/n1−1+(s22/n2)2/n2−1
Example: compare the effect of Drug A and Drug B on blood pressure reduction,
suspect that the variances in response to the drugs may differ.
Null hypothesis is that the underlying population means are the same. The null
hypothesis is written as Ho: Xˉ1= Xˉ2
Alternative hypothesis is that the underlying population means are not same. The
alternative hypothesis is written as H1: Xˉ1≠ Xˉ2
t== (Xˉ1−Xˉ2)/s12/n1+s22/n2
= (10−8)/52/30+82/30
2/25/30+64/30
2/0.833+2.133
2/0.833+2.133
2/2.966
2/1.72
1.16
df=(s12/n1+s22/n2)2/(s12/n1)2/n1−1+(s22/n2)2/n2−1
= (52/30+82/30)2/(52/30)2/30−1+(82/30)2/30−1
= 8.797/ 0.1798
= 8.797/ 0.1798
=48.93
The critical t-value for a significance level of 0.05 and 49 degrees of freedom using a
t-distribution calculator is 2.0096.
We compare the value of our statistic (48.93) to the t value. Since 48.93 > 2.0096, we
reject the null hypothesis that the mean blood pressure reduction observed from drug
A and drug B are equal, and conclude that we have evidence in the blood pressure
reduction is different between drug A and drug B .
The paired t-test is a statistical method used to compare the means of two related (or
dependent) groups. It is commonly used when the same subjects are measured under
two different conditions, or when measurements are taken at two different times. This
test is typically used to assess whether there is a significant difference between two
sets of paired data.
Before and After Studies: For example, measuring the blood pressure of
patients before and after treatment with a drug.
Matched Pairs: When individuals are matched based on certain characteristics
and then compared on some outcome measure, like comparing the efficacy of
two treatments on patients with similar baseline conditions.
Repeated Measures: For example, testing a group of patients’ blood sugar
levels at two different time points.
4. The degrees of freedom for the paired t-test are given by:
df=n−1
Where n is the number of paired observations (i.e., the number of subjects or paired
measurements).
Example: The blood glucose levels observed from six randomly selected patients
before and after administration of a drug is furnished below. Test the significance of
difference in blood glucose levels due to drug
Subject Blood glucose level (mg/dl)
Before treatment After treatment
1 170 150
2 175 165
3 160 150
4 150 135
5 180 170
6 185 160
Null hypothesis (H₀): There is no difference in the means of the blood glucose
levels. Differences in mean=0.
Alternative hypothesis (H₁): There is a significant difference in the means of
the blood glucose levels, μdiff≠0 (two-tailed)
t=Dˉ/sD/n
=15/6.32/6
15/6.32/2.449
15/2.58
5.81
The critical t-value for a significance level of 0.05 and 5 degrees of freedom using a t-
distribution table is 2.571.
We compare the value of our statistic (5.81) to the t value. Since 2.571 > 2.571, we
reject the null hypothesis that the blood glucose level observed after drug
administration is less than the blood glucose level observed before drug
administration , and conclude that we have evidence in the blood glucose level is
different due to drug administration .
9.2 ANOVA
Independence of observations.
Each group should follow a normal distribution.
The variance within each group should be approximately equal.
Critical Region: The area under the curve to the right of a specific F-value (the
critical value) represents the rejection region, where the null hypothesis would
be rejected.
Types of ANOVA
One-Way ANOVA
One-way ANOVA is used to compare the means of three or more groups
based on a single independent variable. It shows if there is a significant
difference among the group means.
o Formula:
Advantages:
Limitations:
One-Way ANOVA
A one-way ANOVA (Analysis of Variance) is used to compare the means of three or
more groups to determine if there is a statistically significant difference between
them.
Example: A clinical trial is conducted with three groups of patients receiving different
dosages of the antihypertensive drug (50,100 and 150 mg). Blood pressure is
measured before and after taking the drug for a specific period. The change in blood
pressure (reduction) for each patient is recorded. Does the dosage of a certain
pharmaceutical drug affect the reduction in blood pressure in patients?
Solution: The data contains one dependent variable (blood pressure reduction) and
one independent variable (dose) with three levels. So the data can be treated
statistically with one way ANOVA.
X2=100+144+196+169+196+225+324+289+256+324+289+361+484+441+529+57
6+625+529=6057
(78)2/6+ (105)2/6+(138)2/6-5724.5
6084+11025+19044/6-5724.5
=6025.5-5724.5=301
ANOVA Table
Critical value: The table value for the degree of freedom 2,15 at 0.05 level of
significance is 3.68( table 10.2)
Inference: The calculated value is greater than the table value and hence null
hypothesis is rejected.
For example, suppose a researcher wants to explore how PH and surfactant affect
drug solubility. Then experiments are conducted under different conditions for PH
and surfactant and record the drug solubility.
X2=982+952+1042+932+ …………….942
9604+9025+10816+8649+10404+11449+11236+9604+8464+10404+11664+8836=
120155
3902/4+4132/4+3962/4-119800.08
=152100+170569+156816/4-119800.08
71.17
2922/3+3042/3+3182/3+2852/3-119800.08
85264+92416+101124+81225/3-119800.08
=120009.6667-119800.08
=209.587
Sum of squares of error= Total Sum of Squares- Sum of squares of columns- Sum
of squares of rows
354.92-71.17-209.587=74.16
ANOVA Table
2. 1.The table value for the degree of freedom3.6 and 0.05 level of significance is
4.76
References
A parametric statistical test is a type of test that relies on specific assumptions about
the parameters of the population from which the sample is drawn. These assumptions
include:
In contrast, a non-parametric statistical test does not make assumptions about the
parameters of the population from which the sample is derived.
It does not require the stringent measurement levels necessary for parametric
tests.
Many non-parametric tests are suitable for ordinal data, and some can be used
with nominal data.
These tests provide P-values but do not report confidence intervals.
Non-parametric tests are more flexible and robust since they make fewer
assumptions, though they tend to have less power when the data would be
more appropriately analyzed using parametric tests.
It is not required for the sample to follow a Gaussian distribution.
Most statistical tests are based on the assumption that the sample follows a known
distribution, such as normal or binomial. If the sample does not conform to these
distributions, the results may deviate from the true values. In such cases, particularly
with non-Gaussian distributions, non-parametric methods are employed. The
commonly used methods for testing the means of quantitative variables generally
assume that the data come from a normal distribution. However, in practice, it is rare
for a distribution to be perfectly normal. Fortunately, tests like the one-sample and
two-sample t-tests, and analysis of variance (ANOVA), are quite resilient to moderate
violations of normality, particularly when the sample sizes are sufficiently large.
Parametric tests such as the Z, t, and F tests focus on drawing inferences about the
population mean(s), while non-parametric tests are typically used to make inferences
about the population median(s).
The statistics used in non-parametric tests often focus on basic elements of the sample
data, such as the signs of measurements, order relationships, or category frequencies.
These tests are not affected by changes in the scale, whether the scale is stretched or
compressed, and as a result, the null distribution for a non-parametric statistic can be
determined without considering the shape of the underlying population distribution.
This makes non-parametric methods ideal when the distribution of the parent
population is uncertain. In fields like social and behavioural sciences, where
observations may be difficult or impossible to quantify on numerical scales, non-
parametric tests are particularly suitable.
These methods typically involve ranking the data, which reduces the precision of the
information (since we convert raw data into relative rankings). Most non-parametric
tests require the data to be ranked on an ordinal scale, where the smallest observation
is assigned a rank of 1, the second smallest a rank of 2, and so on until the largest
observation receives the rank n. By doing this, we may obscure true differences in the
data, making it harder to detect significant differences. In essence, non-parametric
tests require larger differences to be considered statistically significant.
Non-parametric tests generally assume equal variance within groups and can be
applied to nominal variables. These tests are particularly useful when dealing with
small sample sizes or when the assumptions of normality and homoscedasticity (equal
variances) cannot be met. Unlike parametric tests, which are not influenced by sample
size, non-parametric tests vary in their applicability based on sample size. Though
non-parametric tests are easier to compute, they are typically less powerful, meaning
there is a greater risk of committing a Type II error (failing to detect a true effect).
1. The data does not seem to follow a normal distribution, as a larger number of
participants reported an improvement in taste.
2. A clinical study was conducted to assess the prevalence of a microbial
infection, where various cultures were collected for microbial detection. The
Non-Parametric Tests
When dealing with an ordinal, interval, or continuous variable, the values are ranked
from lowest to highest. Non-parametric tests primarily focus on the ranks of the data
rather than the raw values. In cases where there are tied ranks (i.e., identical
responses), each tied observation is assigned the average rank. When performing non-
parametric tests, it is important to check that the sum of the ranks is equal to n(n+1)/2,
where n represents the number of observations. An example of how ranks are
assigned to data is illustrated in the following example.
Ranking Data
Data 22 28 26 25 27
Rank 1 5 3 2 4
In the case of ties, the average of the rank values is assigned to each tied observation
as follows.
Ranking Data
Data 25 28 30 24 26 25 26 28 29 26
In this example there were two 25s (ranks 2, and 3) with an average rank of 2.5,
three26s (ranks 4, 5 and 6) with an average rank of 5 and two 28s (ranks 7, and 8)
with an average rank of 7.5
When comparing sets of data from different groups or different treatment levels
(levels of the independent variable), ranking involves all of the observations
regardless of the discrete level in which the observation occurs:
16 5 20 14 19 11.5
18 9 19 11.5 24 19
17 6.5 15 3.5 23 18
12 1 17 6.5 20 14
14 2 18 9 18 9
15 3.5 22 17 20 14
21 16 25 20
The accuracy of the ranking process can be verified in two ways. First, the
highest rank assigned should match the total number of observations, denoted as N.
Second, the sum of all the ranks should be equal to N(N+1)/2, where N represents the
total number of observations.
3. Ranking Based on Observed Values: Once the data is arranged, ranks are
assigned based on numerical values. For instance, an animal with higher
oedema (measured by diameter) will receive a higher rank, while an animal
with no oedema will receive the lowest rank. These ranks are used for further
analysis in non-parametric tests.
1. Simpler calculations
2. Quick and easy to execute
3. Easy to interpret
4. More efficient than parametric tests when the data is not normally
distributed
5. Applicable to all scales of measurement
6. Fewer assumptions required
1. Less power compared to parametric tests. Power refers to the ability of the
test to detect differences when they exist.
2. Does not utilize the full information provided by the sample.
3. Interpretation can be more challenging.
4. Requires larger sample sizes to achieve comparable results with parametric
tests when both are applicable.
5. May result in wasted information if parametric tests could have been used.
6. Can be difficult to compute by hand for large sample sizes.
7. Statistical tables for non-parametric tests are not always readily available.
1. Sign Test
2. Wilcoxon Signed Rank Test
3. Mann-Whitney Test
4. Kruskal-Wallis Test
5. Friedman Test
6. Multiple Comparison Test
7. Quade Test
8. Rank Correlation
9. Run Test of Randomness
10. Median Test
11. Kolmogorov-Smirnov Test
Mann-Whitney U-Test:
Procedure:
1. Combine the observations from both groups and rank them from the smallest
to the largest, ignoring group labels.
2. Assign equal ranks to identical observations.
3. Reassign the observations back to their respective treatment groups.
n1 ( n1 1)
U1 n1n2 R1
2
n2 ( n2 1)
U 2 n1n2 R2
2
Placebo 7 8 9 10 12 8
New drug 1 3 2 2 4 1
Solution: In the given example, the outcome is a count, the sample size is small (n1 =
n2 = 6), and the data does not follow a normal distribution, as the number of migraine
episodes observed in the placebo group is significantly higher than that in the drug-
treated group. The data is collected from two independent groups (placebo vs. drug
treatment), making the Mann-Whitney U-Test the appropriate statistical method to
analyze the data.
Null Hypothesis: The median number of migraine episodes is the same in both
groups. H0: The median number of migraine episodes in the placebo group is equal to
the median number of episodes in the drug-treated group.
Alternative Hypothesis: The median number of migraine episodes is not the same
between the two groups. H1: The median number of migraine episodes in the placebo
group is not equal to the median number in the drug-treated group.
The ranks should be assigned from the smallest to largest by combining the data from
both groups, as shown in the following example.
Calculation of U statistic:The test statistic for the Mann-Whitney Test is denoted with
U and the smaller value of U is considered.
6(6 1)
U1 6 X 6 57 =0
2
6(6 1)
U2 6X 6 21 =36
2
Decision Rule: The critical value for U can be obtained from the critical value table
for U, given the sample sizes (n1 = 6 and n2 = 6) and a two-tailed test with a
significance level of 0.05. According to the table (10.1), the critical value is 5.
Conclusion: Since the observed U value is smaller than the critical value, the null
hypothesis is rejected. This indicates that the median number of migraine episodes
observed between the two groups is not the same.
For sample sizes equal to or greater than 20, significance is determined using the
normal distribution with the following formula.
| T N1 N1 N 2 1 2 |
Z
N1 N 2 N1 N 2 1 12
Where:
Example: The systolic blood pressure measurements from two groups, each
consisting of 20 subjects from rural and urban areas, are provided. The question is
whether it can be claimed that there is a significant difference in systolic blood
pressure between individuals from rural and urban areas.
8 152 152
9 149 142
10 138 153
11 136 145
12 118 135
13 147 125
14 137 110
15 139 132
16 144 138
17 137 109
18 162 120
19 128 162
20 109 158
Solution: The given data is representing two independent groups, sample size is equal
to 20 and the distribution is not known and hence the data can be treated statistically
by using Mann-Whitney U- Test for large size samples.
Null hypothesis:H0: The median systolic blood pressure observed from rural people is
similar to the median systolic blood pressure observed from the urban people.
Alternative hypothesis: H1: The median systolic blood pressure observed from rural
people is different from the median systolic blood pressure observed from the urban
people.
Calculation of Z statistic: The z value for the given data is calculated as follows.
| 398.5 2020 20 1 2 |
Z =0.311
20 x 2020 20 1 12
Conclusion: The calculated Z value is smaller than the critical Z value from the table,
meaning the null hypothesis is not rejected. Therefore, we conclude that the median
systolic blood pressure of individuals from rural areas is similar to the median systolic
blood pressure of individuals from urban areas.
The Kruskal-Wallis test is primarily used when there is one nominal variable and one
measurement variable, which would typically be analyzed using one-way ANOVA,
but the measurement variable does not meet the normality assumption required for
one-way ANOVA. As a non-parametric test, the Kruskal-Wallis test does not assume
that the data follows a distribution described by just two parameters—mean and
standard deviation. Instead, it operates on ranked data. To perform this test, the
observed values are converted into ranks, with the smallest value receiving a rank of
1, the next smallest a rank of 2, and so on. This process of ranking the data results in a
loss of the original information, which may reduce the power of the test compared to
one-way ANOVA. One-way ANOVA assumes equal variation within groups
(homoscedasticity), whereas the Kruskal-Wallis test does not assume normality, but it
does require that the distributions across groups are similar. Groups with differing
standard deviations would have different distributions, which is why Kruskal-Wallis
is not suitable in such cases.
The Kruskal-Wallis test is ideal when the data involves one nominal variable and one
ranked variable. When this condition is met, a one-way ANOVA cannot be applied,
and the Kruskal-Wallis test is the appropriate alternative.
The Kruskal-Wallis (H) test is used to assess whether independent samples come from
the same population or not. In other words, it tests if the samples come from
populations with identical distributions. Although the Kruskal-Wallis test is
sometimes described as testing if the mean ranks of the groups are the same, it is more
accurate to state that it tests whether the medians of the groups are equal, provided the
shape of the distributions is identical across groups. Unlike one-way ANOVA, which
tests if samples are drawn from a Gaussian distribution, the Kruskal-Wallis test is
employed when samples are not normally distributed or when the data consists of
ranks. It is particularly useful for comparing more than two groups. The H value is
calculated using the following formula
12 R 2t
H= χ 2k –1 = Σ – 3(N +1)
N(N +1) n i
The computed value is then compared with the critical value from the Kruskal-Wallis
test table (Table 10.8). If the calculated value is smaller than the table value, the null
hypothesis is accepted.
When each sample has at least five observations, the sampling distribution closely
approximates a chi-square distribution with k−1k-1k−1 degrees of freedom.
In one-way ANOVA, the goal is to test whether the population means are equal.
However, in the Kruskal-Wallis test, we examine whether the population medians
are equal. The method involves ranking all the observations across all groups,
followed by applying a one-way ANOVA to these ranks instead of the original data
values. This test is typically used when there are more than two treatments. While it is
preferable for the sample sizes to be equal, it is not mandatory. The Kruskal-Wallis
test is essentially an extension of the rank sum test for more than two treatments. If
the average of at least two treatments differs, significant differences are detected. The
procedure involves combining all the data and ranking each observation, then
summing the ranks for each group. This statistic is approximately distributed as a chi-
square distribution with k−1k-1k−1 degrees of freedom. The chi-square
approximation holds true when the sample size in each group is greater than five.
Among various non-parametric tests, the Kruskal-Wallis one-way ANOVA by ranks
is considered the best method for analyzing data from more than two groups.
Example:
Tablets of drug X were produced using three distinct techniques: wet granulation, dry
granulation, and direct compression. The disintegration times observed for tablets
made using these methods are presented below.
Direct compression: 8, 7, 6, 7, 9, 10
Based on the results, can we infer that the median disintegration time for tablets
produced using three different techniques is the same?
Solution:
Null hypothesis (H0): The median disintegration time for the samples is
identical across all groups.
Alternative hypothesis (H1): The median disintegration time for the samples
differs across the groups.
Level of significance (α): 0.05
Decision Rule:
The null hypothesis should be rejected if the calculated value of H is greater than
5.991, which is the critical value for a 0.05 significance level with 2 degrees of
freedom.
The ranks allotted to the tablets prepared with direct compression technique based
on observed disintegration time R1 = 4, 2.5, 1, 2.5,5, 6.5.
The ranks allotted to the tablets prepared with wet granulation technique based on
observed disintegration time R2 = 8, 9.5, 11.5, 9.5, 11.5, 6.5.
The ranks allotted to the tablets prepared with dry granulation technique based on
observed disintegration time R3 = 14.5, 13, 16, 14.5, 17, 18
Sum = 93
12 12303.5
= – 3 19
18 19 6
= 71.9502924–57
= 14.95
Decision
H = 14.95> 5.991
Example
The blood glucose levels observed after three hours of post administration from three
groups, control and two different formulations of antidiabetic drug is given below.
Determine the significance of difference between formulations with Kruskal-Wallis
test at 0.05 level of significance.
Null hypothesis: H0: The sample median blood glucose level is identical
Alternative hypothesis: H1: The sample median blood glucose level is not identical
Criterion
Reject the null hypothesis if H > 5.991, which is the value of 0.05 for 2 degrees of
freedom.
12 R12 R 22 R 32
= – 3(19)
18.719 6 6 6
= 71.17836257 – 3(19)
= 14.17
For the Chi-square test with 2 degrees of freedom, the value must be at least
5.99 to be considered significant at the 5% level.
If statistically significant differences are detected, it is necessary to identify which
treatments are contributing to the differences. To perform pairwise comparisons, the
number of observations in each treatment group must be equal. The difference in the
sum of ranks between two different groups should be calculated. For example, you
would calculate the difference between the control group and one of the formulations,
or alternatively, between two formulations. If the observed differences in ranks
exceed the critical value for the Chi-square distribution, then significant differences
are present between the control and the other two formulations. In some cases, the
difference between the control and formulation 1 may be significant, while in other
cases, the difference between the control and formulation 2 may be significant.
Another possibility is that the difference between the two formulations is significant.
Correction:
To enhance the Chi-square value, a correction can be applied. This adjustment
increases the degree of significance if the null hypothesis is rejected and may reveal
statistically significant differences that were not apparent in the original Chi-square
test.
x2
correction
1 t 3i t i N 3
N
= 14.26
Exercise
The market value of each equity share of five pharmaceutical companies at various
months is as follows. Using Kruskal-Wallis test, check whether the average market
value of the shares at various months differs among the five pharmaceutical
companies.
A B C D E
85 72 141 73 96
Essentially, the Friedman test is an extension of the sign test and is typically applied
when there are multiple treatments. If only two treatments are involved, the Friedman
test and the sign test yield identical results. This test is particularly useful when the
assumptions required for parametric analysis of variance are not met, or when the
scale of measurement is weak, such as in cases where the data is ordinal. It is also
used when the data can be arranged in a two-way ANOVA design, making it
appropriate for situations where ranking is possible and data is derived from more
than two groups.
For the Friedman test to be applicable, the following conditions must be met:
In this test, each treatment is ranked within its block (column), and the test statistic is
calculated using a specific formula.
2
FM= χ c–1 =
12
rc(c +1)
ΣR i2 – 3r(c +1)
If the sample size is sufficiently large, then Chi-square distribution can be used to
approximate the test of significance – The Chi-square test is
2
χ c–1 =
12
rc(c +1)
ΣR i2 – 3r(c +1)
d.f = C–1
r = number of rows
c = number of columns
Ranks should be assigned within each block (such as subject, formulation, method,
etc.). In the case of tied ranks, the average rank should be given to the tied
observations.
For larger sample sizes, if the number of treatments (k) exceeds 5, or if the sample
size (n) is greater than 13, the chi-square critical value table should be used to assess
significance.
Example: The disintegration times from six different formulations, each containing
distinct disintegrants, are provided below. The objective is to determine whether the
variation in disintegration times due to the different disintegrants is statistically
significant.
In the above problem, the lowest disintegration time observed from formulation 1 is
with disintegrant A is 8minuts so it is assigned 1st rank. The next highest value
observed is from the formulation containing disintegrant B is12 so it is assigned rank
2 andformulation containing disintegrant C is15 and assigned rank 3. The same
procedure is followed for other groups. The assigned ranks are shown in the following
table.
χ2 tabled value with level of Probability 0.05 and having d.f = C–1 = 2 is 5.99
Exercise.
The following are the runs scored by 11 batsmen of a various cricket teams in three
seasons, test the null hypothesis that the batsmen constituting the population from
which the sample was drawn perform equally well in all the tree seasons against the
alternative hypothesis that they perform better in at least one season.
Batsman no A B C
To get approximate F distribution with C-1, (C-1) (r-1) degree of freedom and to get
better comparison compared to Chi-square distribution for the Friedman test, this
modification is applied. In this case, the statistic parameter T2 is calculated with the
following formulae.
A2= ∑xi2wherexi are the individual ranks. If there are no ties, thenA2 is calculated with
the following equation.
= 3 × 4 × (7) = 84
1 1 1
B2 = (c i )2 (6) 2 (12) 2 (18)2 [504] 84
r 6 6
(3 1)2
(6 – 1) 84 – 3.6
4
T2 = =∞
84 – 84
C j
ci t 2r A2 B2 r 1c 1
t = tabled t value with (r – 1) (c – 1) d.f at the specified level (usually 0.05)
For the above problem, the table t-value for the degree of freedom (6-1) (3-1) =10 at
5% level of significance for two tailed test is 2.23.
Any difference in rank sum greater than or equal to 0 is significant at the 0.05 level.
For the above problem, the rank sums observed from disintegrants A&B, A&C and
B&C are significantly differ and indicating that the differences in disintegration time
observed from the tablets containing three disintegrants is significant.
Both the Kruskal-Wallis and Friedman tests are non-parametric methods used to
compare three or more independent groups (non-Gaussian distribution). When
analyzing such data, a post-test is often required for more precise comparisons
between the groups. Dunn’s test is one such post-test that performs pairwise
comparisons across groups. It evaluates the difference in the sum of ranks between
two columns against the expected average difference, and computes a 't' value for
each pair of columns. These 't' values help determine whether the differences are
statistically significant.
In Dunn’s test, the mean rank differences between the groups are compared pair by
pair. For each comparison, a standard deviation is calculated. The absolute difference
in ranks for each pair is then divided by its standard deviation to produce a 't' statistic.
This statistic is compared against the standard normal z distribution. If the resulting
statistic at α/2 (0.05) is sufficiently large, the null hypothesis, which posits no
difference between the ranks, is rejected. Multiple comparisons are conducted only if
a significant overall result is found in the main test at the usual α (0.05) level. This
approach helps mitigate errors that might arise from conducting multiple comparisons
in non-parametric tests.
Dij Rt R j Ri R j
Tij
ij ij N ( N 1) 1 1
12 ni n j
Example: Tablets of drug x were prepared by three different techniques such as wet
granulation, dry granulation and direct compression techniques. The disintegration
time observed from the tablets prepared by these methods is given below.
Direct compression: 8, 7, 6, 7, 9, 10
Based on the previous analysis, we can determine whether there are significant
differences in the disintegration times of tablets produced using different techniques.
If significant differences are found, it is important to identify which specific technique
is primarily responsible for these variations.
Solution:
This problem has been addressed previously, and the results demonstrated that the
disintegration times of tablets prepared using different techniques show statistically
significant differences (as determined using the Kruskal-Wallis one-way ANOVA).
The observed differences can be attributed to the following factors:
1. The disintegration times of tablets made with direct compression and wet
granulation methods are significantly different.
2. The disintegration times of tablets made with direct compression and dry
granulation methods are significantly different.
3. The disintegration times of tablets made with wet granulation and dry
granulation methods are significantly different.
To pinpoint the specific reasons behind these differences, the data should be further
analyzed using Dunn’s multiple comparison test.
This test will be applied to the first and second observations to identify the
contributing factors.
The ranks allotted to the tablets prepared with direct compression technique based on
observed disintegration time R1 = 4, 2.5, 1, 2.5,5, 6.5.
Mean rank=21.5/6=3.58
The ranks allotted to the tablets prepared with wet granulation technique based on
observed disintegration time R2 = 8, 9.5, 11.5, 9.5, 11.5, 6.5.
Mean rank=56.5/6=9.42
Dij Rt R j R1 R 2
Tij
ij ij N ( N 1) 1 1
12 n n
1 2
The calculated value is greater than table Z value and hence null hypothesis is rejected
and concluded that disintegration time observed from the tablets prepared with direct
compression and wet granulation techniques is statistically significant
The ranks allotted to the tablets prepared with direct compression technique based on
observed disintegration time R1 = 4, 2.5, 1, 2.5,5, 6.5.
Mean rank=21.5/6=3.58
The ranks allotted to the tablets prepared with dry granulation technique based on
observed disintegration time R3 = 14.5, 13, 16, 14.5, 17, 18
Sum = 93
Mean rank=93/6=15.5
The calculated value is greater than table Z value and hence null hypothesis is rejected
and concluded that disintegration time observed from the tabletsprepared with direct
compression and dry granulation techniques is statistically significant
The ranks allotted to the tablets prepared with wet granulation technique based on
observed disintegration time R2 = 8, 9.5, 11.5, 9.5, 11.5, 6.5.
Mean rank=56.5/6=9.42
The ranks allotted to the tablets prepared with dry granulation technique based on observed
disintegration time R3 = 14.5, 13, 16, 14.5, 17, 18
Sum = 93
Mean rank=93/6=15.5
The calculated value is greater than table Z value and hence null hypothesis is rejected and
concluded that disintegration time observed from the tablets prepared with wet granulation
and dry granulation techniques is statistically significant.
Finally it is concluded that all three techniques are different in their performance and
each pair wise comparison is also statistically significant.
Table 10.1: Critical values for Mann-Whitney U- Test:
Table 10.1
Non-Parametric Tests
Table 10.2
Non-Parametric Tests
References
1.Nahm FS. Nonparametric statistical tests for the continuous data: the basic concept
and the practical use. Korean journal of anesthesiology. 2016;69(1):8-14.
2.De Muth JE. Basic statistics and pharmaceutical statistical applications. CRC Press;
2014.
7. Johnson RW. Alternate forms of the one-way ANOVA F and Kruskal–Wallis test
statistics. Journal of Statistics and Data Science Education. 2022 Jan 2;30(1):82-5.
10. VincyTam WC, Burley JB, Rowe DB, Machemer T. Comparison of five green
roof treatments in Flint Michigan with Friedman's two-way analysis of variance by
ranks. Journal of Architecture and Construction. 2020;3(1):23-36.
11.Dunn OJ. Multiple comparisons among means. Journal of the American statistical
association. 1961;56(293):52-64.
Prof. (Dr.) T.E. Gopala Krishna Murthy is serving as a Professor and Principal at Bapatla
College of Pharmacy, Bapatla, Andhra Pradesh, India. He has 30 years of experience in
teaching, research, and administration. He completed his B.Pharm from Gulbarga University,
M.Pharm from Birla Institute of Technology, and Ph.D. from J.N.T. University. He has
published 271 research and review papers in international and national journals, of which 116
papers are published in Scopus-indexed journals. He serves as a reviewer and editorial board
member for reputed journals. He has delivered numerous guest lectures at various national-
level events. Prof. Murthy has published 6 books in pharmacy with reputed publishers. He
holds 9 granted patents and 6 published patents. He has received 2 research grants from
AICTE. He has guided 25 Ph.D. and 105 M.Pharm students. He has received the Meritorious
Teacher Award from JNTUK three times, the Best Principal Award from JNTUK, the Best
Researcher Award from JNTUK, and the Best Teacher Award from APTI AP State Branch. He
has also received fellowships from the Association of Biotechnology & Pharmacy and the AP
Academy of Sciences and recipient of Sir C. V. Raman award from Science city of Andhra
Pradesh. He is currently acting as a CEC Member of the Indian Pharmaceutical Association,
Mumbai, and as the Regional Coordinator for the AP Academy of Sciences, Guntur Region.
He also visited Tashkent institute of pharmaceutical sciences Uzbekistan to deliver guest
lectures. He is serving as a member of the Board of Studies for Pharmacy Courses at JNTUK.
He is a life member of professional bodies such as IPA and APTI.
Mrs. Ch. Sushma is currently serving as an Assistant Professor at Bapatla College of Pharmacy with
17 years of teaching experience. She holds a Postgraduate degree in Pharmaceutical Analysis and
Quality Assuarance from Jwaharlal Technological Univeristy, Kakinada. Mrs. Sushma has guided 5
M.Pharm and 10 B.Pharm students.