Statistics and Data Analysis
in
Proficiency Testing
Michael Thompson
School of Biological and Chemical Sciences
Birkbeck College (University of London)
Malet Street
London WC1E 7HX
[email protected]
Organisation of a proficiency test
“Harmonised Protocol”. Pure Appl Chem. 2006, 78, 145-196.
Where do we use statistics in
proficiency testing?
• Finding a consensus and its uncertainty to
use as an assigned value
• Assessing participants’ results
• Assessing the efficacy of the PT scheme
• Testing for sufficient homogeneity and
stability of the distributed test material
• Others
Criteria for an ideal scoring
method
• Adds value to raw results.
• Easily understandable, based on the
properties of the normal distribution.
• Has no arbitrary scaling transformation.
• Is transferable between different
concentrations, analytes, matrices, and
measurement principles.
How can we construct a score?
• An obvious idea is to utilise the properties
of the normal distribution to interpret the
results of a proficiency test.
BUT…
We do not make
any assumptions
about the actual
data.
Example dataset A
• Determination of protein nitrogen in a meat
product.
A weak scoring method
z
x x s
x 2.126
s 0.077
• On average, slightly more than 95% of laboratories
receive z-score within the range ±2.
Robust mean and standard
deviation
ˆrob ,
ˆrob
• Robust statistics is applicable to datasets that
look like normally distributed samples
contaminated with outliers and stragglers (i.e.,
unimodal and roughly symmetric.
• The method downweights the otherwise large
influence of outliers and stragglers on the
estimates.
• It models the central ‘reliable’ part of the dataset.
Can I use robust estimates?
Skewed
Bimodal
Heavy-tailed
Measurement axis
x T
x1 x 2 xn
Huber’s H15
Set 1 k 2, p 0,
ˆ0 median, 0
ˆ 1.5 MAD
xi if
ˆp kˆp xi
ˆp k
ˆp
~
xi
ˆ p kp
ˆ if xi ˆp kˆp
ˆ
p k ˆp if xi
ˆp kˆp
ˆ mean ( ~
xi )
p
1
ˆ2p 1 f ( k ) var( ~
xi )
If not converged, p p 1
References: robust statistics
• Analytical Methods Committee,
Analyst,1989, 114, 1489
• AMC Technical Brief No 6, 2001
(download from www/rsc.org/amc)
• P J Rousseeuw, J. Chemomet, 1991, 5, 1.
Is that enough?
z
x
ˆrob
ˆrob
ˆrob 2.128
ˆrob 0.048
• On average, slightly less than 95% of
laboratories receive a z-score between ±2.
What more do we need?
• We need a method that evaluates the data
in relation to its intended use, rather than
merely describing it.
• This adds value to the data rather than
simply summarising it.
• The method is based on fitness for
purpose.
Fitness for purpose
• Fitness for purpose occurs when the uncertainty
of the result uf gives best value for money.
• If the uncertainty is smaller than uf , the analysis
may be too expensive.
• If the uncertainty is larger than uf , the cost and
the probability of a mistaken decision will rise.
Fitness for purpose
• The value of uf can sometimes be estimated
objectively by decision theoretic methods, but is
most often simply agreed between the
laboratory and the customer by professional
judgement.
• In the proficiency test context, uf should be
determined by the scheme provider.
Reference: T Fearn, S A Fisher, M Thompson,
and S L R Ellison, Analyst, 2002, 127, 818-824.
A score that meets all of the
criteria
• If we now define a z-score thus:
z
x
ˆrob p where p u f
we have a z-score that is both robustified against
extreme values and tells us something about fitness
for purpose.
• In an exactly compliant laboratory, scores of 2<|z|<3
will be encountered occasionally, and scores of |z|>3
rarely. Better performers will receive fewer of these
extreme z-scores.
Example data A again
• Suppose that the fitness for purpose criterion set
for the analysis is an RSD of 1%. This gives us:
p 0.01 2.1 0.021
Finding a consensus from
participants’ results
• The consensus is not theoretically the best
option for the assigned value but is usually
the only practicable value.
• The consensus is not necessarily identical
with the true value. PT providers have to
be alert to this possibility.
What is a ‘consensus’?
• Mean? - easy to calculate, but affected by
outliers and asymmetry.
• Robust mean? - fairly easy to calculate, handles
outliers but affected by asymmetry.
• Median? - easy to calculate, more robust for
asymmetric distributions, but larger standard
error than robust mean.
• Mode? - intuitively good, difficult to define,
difficult to calculate.
The robust mean as consensus
• The robust mean provides a useful consensus
in the great majority of instances, where the
underlying distribution is roughly symmetric
and there are 0-10% outliers.
• The uncertainty of this consensus can be
safely taken as
u
xa
ˆrob n
When can I use robust estimates?
Skewed
Bimodal
Heavy-tailed
Measurement axis
Skewed distributions
• Skews can arise when the participants’
results come from two or more
inconsistent methods.
• They can also arise as an artefact at low
concentrations of analyte as a result of
data recording practice.
• Rarely, skews can arise when the
distribution is truly lognormal.
Possible use of a trimmed data
set?
Can I use the mode?
How many modes? Where are they?
The normal kernel density for
identifying a mode
n
x xi
y
1
nh i 1 h
where Φis the standard normal density,
exp( a / 2)
2
(a)
2
AMC Technical Brief No. 4
A normal kernel
A kernel density
Another kernel density
Graphical representation of sample data
Kernel density of the aflatoxin data
Uncertainty of the mode
• The uncertainty of the consensus can be
estimated as the standard error of the
mode by applying the bootstrap to the
procedure.
• The bootstrap is a general procedure
based on resampling for estimating
standard errors of complex statistics.
• Reference: Bump-hunting for the proficiency
tester – searching for multimodality. P J
Lowthian and M Thompson, Analyst, 2002,127,
1359-1364.
The normal mixture model
m m
f ( y ) p j f j ( y ), p j 1
j 1 j
1
exp( ( y j ) / 2 2 2
f j ( y)
2
AMC Technical Brief No 23, and AMC Software.
Thompson, Acc Qual Assur, 2006, 10, 501-505.
Mixture models found by the maximum
likelihood method (the EM algorithm)
• The M-step
n
pˆ j Pˆ( j y i ) / n
i 1
n n
̂j y i Pˆ
( j yi ) Pˆ
( j yi )
i 1 i 1
2
n m
j
ˆ
1i
1
2 ˆ
( yi j ) P( j yi )
ˆ Pˆ( j y )i
• The E-step
m
Pˆ
( j yi ) pˆj f j ( yi ) pˆj f j ( yi )
j
1
Kernel density and fit of 2-component
normal mixture model
Kernel density and variance-inflated
mixture model
Useful References
• Mixture models
M Thompson. Accred Qual Assur. 2006, 10, 501-505.
AMC Technical Brief No. 23, 2006. www/rsc.org/amc
• Kernel densities
B W Silverman, Density estimation for statistics and data
analysis. Chapman and Hall, London, 1986.
AMC Technical Brief, no. 4, 2001 www/rsc.org/amc
• The bootstrap
B Efron and R J Tibshirani, An introduction to the
bootstrap. Chapman and Hall, London, 1993
AMC Technical Brief, No. 8, 2001 www/rsc.org/amc
Conclusions—scoring
• Use z-scores based on fitness for
purpose.
• Estimate the consensus as the robust
mean and its uncertainty as ˆrob n
if the dataset is roughly symmetric.
• If the dataset is skewed and plausibly
composite, use kernel density methods
or mixture models
Homogeneity testing
• Comminute and mix bulk material.
• Split into distribution units.
• Select m>10 distribution units at random.
• Homogenise each one.
• Analyse 2 test portions from each in
random order, with high precision, and
conduct one-way analysis of variance on
results.
Design for homogeneity testing
MSB MSW
san MSW , ssam
2
Problems with simple ANOVA
based on testing
H 0 : sam 0
• Analytical precision too low—method
cannot detect consequential degree of
heterogeneity.
• Analytical precision too high—method
finds significant degree of heterogeneity
that may not be consequential.
(Everything is heterogeneous!)
“Sufficient homogeneity”:
original definition
• Material passes homogeneity test if
ssam L 0.3p
• Problems are:
– ssam may not be well estimated;
– too big a probability of rejecting
satisfactory test material.
Fearn test
• Test H 0 : sam
2
L
2
by rejecting when
2
ssam
L m 1
2 2
2
san Fm1,m 1
m 1 2
Ref: Analyst, 2001, 127, 1359-1364.
Problems with homogeneity
data
• Problems with data are common:
e.g., no proper randomisation, insufficient
precision, biases, trends, steps,
insufficient significant figures recorded,
outliers.
• Laboratories need detailed instructions.
• Data need careful scrutiny before
statistics.
• HP1 is incorrect in saying that all outlying
data should be retained.
General references
• The Harmonised Protocol (revised)
M Thompson, S L R Ellison and R Wood
Pure Appl. Chem., 2006, 78, 145-196.
• R E Lawn, M Thompson and R F Walker,
Proficiency testing in analytical chemistry. The
Royal Society of Chemistry, Cambridge, 1997.
• ISO Guide 43. International Standards
Organisation, Geneva, 1997.