0% found this document useful (0 votes)
219 views12 pages

A Short History of Chemometrics: A Personal View: Richard G. Brereton

Uploaded by

payal_joshi_14
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
219 views12 pages

A Short History of Chemometrics: A Personal View: Richard G. Brereton

Uploaded by

payal_joshi_14
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Perspective

Received: 06 February 2014, Revised: 10 April 2014, Accepted: 16 April 2014, Published online in Wiley Online Library: 19 May 2014

(wileyonlinelibrary.com) DOI: 10.1002/cem.2633

A short history of chemometrics: a personal


view
Richard G. Brereton*

This article traces chemometrics back to its origins in scientific computing in the 1960s. Its development is compared
in other computational disciplines such as bioinformatics. The change in geographical origins of papers published in
the core chemometrics literature is discussed. It is concluded that the level of core activities in this area has hardly
changed over several decades, whilst there has been a significant expansion in non-expert users of packages over
this period. It is estimated that around 2% of people encountering chemometrics in their research can be considered
real experts. The problems of non-experts using chemometrics methods with limited knowledge of the statistical
fundamentals are explored. The contrasting development of chemometrics compared with, for example, computa-
tional chemistry and bioinformatics, is interpreted in terms of the changing financial pressures on research over
its key developmental phase, as illustrated by the change in academic finance in the UK over the past 50 years.
Copyright © 2014 John Wiley & Sons, Ltd.

Keywords: historical review; origins of chemometrics; skill gap; financial development; publication analysis

1. EARLY HISTORY science, the ideas took off rapidly in that period. The discipline
was and still is often called theoretical or computational chemis-
1.1. Origins of computational metric and informatics try and continues as a major discipline linked to molecular
disciplines modelling to this day [5]. Large databases of code in FORTRAN
The origins of chemometrics can be traced back to the 1960s, were deposited in the 1960s, which even to this day are
when scientific computing became generally accessible. accessed. The Gaussian software package was first reported by
Pioneers would use large mainframes, often programming in Pople and co-workers in 1970 [6], leading to a Nobel Prize, and
FORTRAN or ALGOL, inputting and storing programs using is still maintained and used now. Quantum structure–activity
punchcards and paper tape, often time slicing, with a line printer relationships [7,8] were recognised as a specific discipline a little
output arriving the next day. During this era, computers were not later, in the 1970s, but as is common, this area had started to
viewed as essential desktop tools, but as large and remote develop many years before: classical structure–activity relation-
calculating machines for specialist programmers. The first ships such as between reactivity and substituents on benzene
recognisable and usable programmable modern computers rings were known for decades, but the major catalyst was the
had their origins in the 1940s, spurred on by the war effort, wide availability of computer power.
especially code breaking, although there has been a long Bioinformatics and chemoinformatics had their origins also in
vintage over more than a century in the development of proto- this era. Subsequently, bioinformatics became important
type computational machines. because of the human genome project, which rapidly increased
As the 1960s developed, mainstream scientists could have its visibility, with concomitant funding, centres, courses and
access to computers, and it was not necessary anymore to be an departments. Many medical projects involve some genetic profil-
engineer or mathematician to use these machines. This allowed ing and hence the need for data mining and bioinformatics.
several disciplines to emerge, some of which have developed as Chemoinformatics also has taken off. It differs from
major, well-funded, areas, often spawning departments and insti- chemometrics, as it involves database mining and structural
tutes, and others that have remained minor curiosities. representation, which are a large area given huge commercial
Computer-aided structure elucidation [1,2], primarily via the databases of compounds and their properties. There is some
Dendral project, was a major development that is often attrib- relationship between the development of chemoinformatics
uted to the birth of artificial intelligence. Computer-aided syn- and bioinformatics. Both disciplines were catalysed in the
thesis [3,4] also had its origins in that decade. Both 1980s and 1990s when large amounts of genetic and chemical
developments were highly funded and considered important, information started to become publicly available.
but neither has developed into major fields, and both are primar-
ily historic curiosities.
Quantum chemistry as a theoretical discipline had its origins * Correspondence to: Richard G. Brereton, School of Chemistry, University of
almost a century earlier but was inhibited by the need to per- Bristol, Cantocks Close, Bristol BS8 1TS, UK.
E-mail: [email protected]
form large hand calculations (a PhD in the 1930s might involve
calculating a few integrals) but was massively stimulated in the R. G. Brereton
1960s by access to computers: being considered very much core School of Chemistry, University of Bristol, Cantocks Close, Bristol BS8 1TS, UK
749

J. Chemometrics 2014, 28: 749–760 Copyright © 2014 John Wiley & Sons, Ltd.
R. G. Brereton

1.2. Birth of chemometrics many countries to be considered that of adulthood) and by the
end of the decade reached 25 years of age, often the years
So what of chemometrics? So too did early computer-oriented
between 18 and 25 are spent studying. So much happened fast
scientists become active in what we now regard as
in the 1980s, when chemometrics developed via adolescence to
chemometrics, in the 1960s and early 1970s. Malinowski [9], Jurs
true adulthood, from the first dedicated journals (Chemometrics
[10,11] and Massart [12] were publishing work that we would
and Intelligent Laboratory Systems and Journal of Chemometrics)
now recognise as chemometrics during this period. Meanwhile,
to the first book with chemometrics in the title [19], several
Wold invented the word chemometrics (or in Swedish
ACS symposia, the first book series (Research Studies Press),
kemometri) for a grant application in late 1971 and then joined
the first dedicated software (ARTHUR, SIMCA and UNSCRAMBLER) and
with Kowalski in 1974 to create the International Chemometrics
the first workshops. This author contributed the first formal
Society [13]. The first paper with the word chemometrics in it
paper to be published in the chemometrics literature [20]. Since
was published by Wold in 1972 [14]; remarkably, considering
then, of course, there have been numerous books with
the important historic vintage of the first mention of
chemometrics in the title, and a very large number of meetings.
chemometrics in the literature, there are only seven recorded
By the end of the 1980s, the future looked rosy. From a dispa-
citations to this paper in Web of Science at the time of writing.
rate band of enthusiasts in the 1960s and 1970s, often dab hands
Wold and Kowalski are credited with creating chemometrics,
at programming and interested in computers, the discipline’s
but in practice, although certainly amongst the important pio-
unruly early years had been tamed by the first attempts to
neers, they named an existing discipline that had already been
formalise the subject.
seeded in the mid-1960s. If a new species is found in the jungle,
However, over the years, other applied informatics and com-
the explorer is given great credit for finding and naming this spe-
putationally based disciplines have overtaken chemometrics.
cies, but the species existed (or evolved) before the explorer
Bioinformatics, still in its infancy before the human genome pro-
came onto the scene. Hence, chemometrics can be considered
ject; quantum structure–activity relationships, not yet joined at
as an area that took off with the advent of scientific computing,
the hip to quantum chemistry; and chemoinformatics have all
especially with the development of computerised laboratory-
overtaken chemometrics as accepted and formal core scientific
based instrumentation, but how does it differ from the other
disciplines. Chemometrics has certainly survived, and there is
disciplines that were developing at the time? Some people do
definitely a very organised corpus of knowledge with established
not distinguish chemometrics from statistics in analytical chem- journals and books, but it has not led to many academic depart-
istry [15], and indeed, some of the formal definitions of ments or institutes and primarily consists of a few large research
chemometrics would make no distinction; however, statistical groups led by one or two investigators, and a significant number
methods in analytical chemistry have been around for more than of lone individuals or small ‘cells’ primarily in industry and a lot of
a century and can be regarded more as a parent of the subject. software that is widely available. It has, though, survived unlike com-
What distinguishes chemometrics is primarily the use of computa- puter-aided synthesis or computer-aided structure elucidation. The
tionally intense approaches, most of which are multivariate, reasons for the development will be examined in the following.
although there is certainly a relationship between the two disci-
plines. For example, the t-statistic commonly employed in tradi-
tional analytical chemistry to, for example, compare the means of
2. WHERE IS CHEMOMETRICS TODAY?
two distributions, can easily be extended to multivariate situations
where the F-statistic would more appropriately be employed. It is In a good detective story, often, after setting the scene, the crime
considered by many that chemometrics should form an element such as a murder happens, and then the novelist or scriptwriter
of the education of most analytical chemists. flashes back to the events that caused this crime, as the story un-
Experimental design is also viewed by many as an integral part ravels. So rather than a descriptive linear historic text, we will
of chemometrics. However, the concept of using formalised de- wind forward to the present and in Section 3 look back at events
signs especially in industrial chemistry has been recognised in the intervening years to find out why chemometrics has
since the 1940s, a product of the war effort. Davies published a reached its present state.
book [16] based on experiences of a team of engineers, chemists It is important to understand this author’s perspective. Al-
and statisticians in ICI. Prior to this era, the use of experimental though much involved in development of the subject since the
design was particularly important in the agricultural sector and late 1980s, as one of the first editors of Chemometrics and Intelli-
viewed as a cornerstone of biometrics. There are a series of gent Laboratory Systems [21], publishing and editing some early
books and papers over many years on experimental design books [22,23], international workshops and national organisa-
and optimisation as oriented towards chemists. Deming was ac- tion, this author has primarily been an outsider looking in. The
tively advocating simplex methods in the early 1970s [17]. central cabal of chemometrics is a small group, which can hardly
Into this mix emerged chemometrics, kicking and screaming, expand; like in a fish tank, there is a limit to the number of fish
named and discovered by Wold, but involving a very diverse that a tank can accommodate easily, although as old ones die,
cross section of investigators. The date of birth cannot be exactly new ones can be introduced. So the author’s observations are
ascertained but probably somewhere around 1965, with the first primarily that of an outsider looking in. This author’s PhD and
recognisable publications in the late 1960s. postdoctoral work in Cambridge in the late 1970s and early
A meeting held in Cosenza in 1983 [18] was probably the first 1980s took him from physical organic chemistry to mathematical
major attempt to get together a diverse international range of processing of analytical data, with good collaboration with math-
scientists working in this discipline although there had been sev- ematicians in the maximum entropy approach, which at the time
eral other initiatives such as symposia and reviews running up to was very new, for denoising nuclear magnetic resonance (NMR)
this event; if we date the anonymous birth of the subject to 1965, spectra [24] and so did not develop from the small inner bubble
750

chemometrics reached 18 years of age in 1983 (common in of groups that were emerging in the late 1970s.

wileyonlinelibrary.com/journal/cem Copyright © 2014 John Wiley & Sons, Ltd. J. Chemometrics 2014, 28: 749–760
A short history of chemometrics

If we accept the birth of the subject to be the mid-1960s, then most prolific countries by publications, we see a remarkable
chemometrics is approaching 50—like a cat adopted as a stray, change in 2012. The USA is no longer so dominant. In
we can only be approximate about its age, but it has moved Chemometrics and Intelligent Laboratory Systems, papers from the
from early adulthood to middle age—a time when a person’s life USA declined from 40.2% to 8.9%, and in Journal of Chemometrics,
has taken up its characteristics, when early promise has been from 29.6% to 16.4%. Sweden, the Netherlands and the UK suffer
achieved or otherwise, when wealth, health, career, possessions strong declines, each appearing in only one of the journal lists,
and family are following a lifetime’s route. So this is a good time Norwegian authors publish just one more paper than those from
to look back and see what went on in the life of chemometrics Serbia and Tunisia in Journal of Chemometrics and suffer a small de-
and whether it is too late to change. cline in Chemometrics and Intelligent Laboratory Systems. Italy tends
to retain a small role in both years, of between 6% and 11% of pub-
lications, and only Spain improves its profile. France improves its
2.1. How has activity changed as evidenced from core
profile in Chemometrics and Intelligent Laboratory Systems. In con-
journals?
trast, there is a huge increase in papers from China, and both Iran
A remarkable observation when looking forward from the 1980s and India are starting to increase their output. These three coun-
to 2013 is that core chemometrics activity has hardly increased tries together account for 50 papers in Chemometrics and Intelligent
and indeed has reduced in the regions where the subject was Laboratory Systems in 2012 compared to two in 1992. For Journal of
born. In order to understand this, it is useful to compare the Chemometrics, this trend appears slightly less striking, increasing
papers authored in the two core chemometrics journals, Journal from 0 in 1992 to 14 in 2012. There are small increases for Hungary
of Chemometrics and Chemometrics and Intelligent Laboratory and Denmark primarily attributed to lead people, and Belgium is
Systems, between 1992 and 2012 and their geographical spread. fairly static in numbers of papers but reduced in proportion. For
The former year is taken as a year when Journal of Chemometrics 2013, these trends are even more remarkable: at the time of writ-
had a critical level (27) of publications to make an analysis mean- ing, the full records are not available, but so far, for Journal of
ingful. We use Web of Knowledge for the results. Figure 1 is a Chemometrics, 15 or 32.6% of papers are from Iran, more than dou-
breakdown of authorship of papers in these two journals by ble those from the USA, and 43 papers or 33.59% for Chemometrics
country (if there is co-authorship from more than one country, and Intelligent Laboratory Systems are from China, followed by 11
each is counted). A country where less than two papers have papers (8.59%) from Iran in second place.
been published is excluded from the analysis. So what has happened is that the chemometrics journals are
There are some interesting conclusions. The first is that publishing far less papers from the classical regions of Scandina-
although Journal of Chemometrics had a lower volume of publica- via, the USA, Benelux and the UK, where some may regard the
tions, the geographical spread of authorship in the two journals discipline as having started, and very much more from emerging
was not much different. The six countries represented are the countries such as China, Iran and India. This is an amazing demo-
USA, the UK, Italy, Norway, Sweden and Spain. These correspond graphical change. The only region that has remained static over
to the same six most represented countries in Chemometrics and the past two decades is Southern Europe, especially with Italy
Intelligent Laboratory Systems for the same year, apart from the amongst the top six nations in both journals in both 1992 and
Netherlands replacing Spain. If we consider these as the seven 2012 and a good representation from Spain, which, however, is

Figure 1. Publication statistics from the two core chemometrics journals in 1992 and 2012: countries with less than two papers are recorded
751

under ‘others’.

J. Chemometrics 2014, 28: 749–760 Copyright © 2014 John Wiley & Sons, Ltd. wileyonlinelibrary.com/journal/cem
R. G. Brereton

uneven between the two journals. Hence, we may conclude that estimates and hypotheses, so the numerical information here
the trends in Southern Europe are intermediate between those represents an approximation.
of Northern Europe/the USA and emerging nations. A possible indicator is given by the total volume of publica-
However, this, as we will discuss later, does not necessarily tions including some chemometrics. This author has been asked
mean a reduction in chemometrics activities in the more classical by over 50 journals to review papers including some level of
Northern European countries, but less people identifying them- chemometrics. A guess is that there are around 100 journals that
selves as chemometrics experts. publish at least 10% of papers including methods that could be
In addition, there has been very little increase in the number regarded as involving chemometrics, which may only involve a
of papers published in these journals. In 1992, Chemometrics principal component (PC) scores plot or partial least squares
and Intelligent Laboratory Systems published 139 papers com- (PLS) calibration: assume that the average proportion of papers
pared with 158 in 2012. For Journal of Chemometrics, the increase including such material is 15% (most will publish just a few such
appears more dramatic, from 27 in 1992 to 67 in 2012, but this is papers containing chemometrics). An estimate of the average
from a low base, and the journal gradually developed momen- annual number of papers published in such journals is 300 (there
tum in the late 1990s. None of this, however, suggests a large are a few such as Analytical Chemistry and Journal of Chromatog-
increase in core activity and, if the contributions of Asian countries raphy A that may publish over 1000 in a given year, but most are
were removed, would be evidence of a decline. of low volume), then of approximately 30 000 papers published in
In order to be sure that these trends are not due to global these journals, we can assume 4500 contain some recognisable
trends in productivity, whereby Chinese, Indian and Iranian chemometrics content. Add to this papers scattered in journals
workers are starting to dominate international journals, we look that only occasionally publish chemometrics, and we achieve a
at the comparable statistics for the journal Bioinformatics as guesstimate of around 6000 papers per year that contain some
illustrated in Figure 2. The same lead countries are represented level of chemometrics. This may range from a very simple graph
in both years, with the USA, the UK, Germany and France all in to highly sophisticated theory.
the top for productivity, but with China included amongst this To test out that this estimate is approximately correct, we look
group in 2012. Iran and India are not represented in either year. at citations to the well-known paper by Geladi and Kowalski on
Hence, the main countries remain somewhat stable over the PLS [25], currently cited around 2600 times in total and so a
years. This contrasts strongly with the dramatic changes that widely recognised article. If we assume there are 6000 papers in-
have occurred in chemometrics. volving chemometrics, then we estimate that around 1000 use
The big change in Bioinformatics over this period is the volume PLS. The topic ‘partial least squares’ in the Web of Knowledge
of core literature, with 745 papers in 2012 compared with 103 in yields 1988 papers for 2012, but many are categorised as outside
1992. It is interesting to note that fewer papers were published in what would be recognised as chemometrics; for example, only
this journal in 1992 than in Chemometrics and Intelligent Labora- 319 are categorised as being in the area of analytical chemistry.
tory Systems. Twenty-five years ago, many may have anticipated PLS although widely accepted in analytical chemistry is not the
that chemometrics would have a brighter future than bioinfor- most widespread method reported in papers, which is PC analy-
matics, but this has not been realised. Core research activities sis (PCA). We then assume that between one fifth and one tenth
in Western countries have on the whole stayed static or declined, cite Geladi’s paper (many users of PLS will be unaware of the
being replaced by Eastern nations, but at a fairly comparable fundamental literature and if they are forced by referees or
level. Bioinformatics shows mainly a steady expansion with a supervisors to cite a PLS paper will often pluck one from thin
similar geographical spread. air or perhaps cite some other book or paper that is also funda-
mental), and we arrive at an estimate of between 100 and 200
citations for 2012. In fact, 166 citations are recorded for this
2.2. How many people do chemometrics?
paper. This all suggests we are likely to be in the right region.
The preceding discussion may appear unduly pessimistic—is the We also can see how citations to this paper change over the
profession of a chemometrician dying out? Has chemometrics years (Figure 3) and see that in contrast to the fairly static change
been kicked into the long grass? The following is an attempt to in the number of papers published in the core literature, the
obtain some numerical data about who is using chemometrics. number of citations per year is increasing, suggesting more
Most science involves educated guesses that then lead to people are using these methods. Incidentally, the number of

35 400
30
30 Bioinformatics 350
335 Bioinformatics
25 1992 300 2012
250
20
15 200
15 13
150 106105
10 8 8 89
6 100 56 46
4 3 3 3 37 35 32 30 27 23 21
5 2 2 2 2 2 50 17 15 14 13 13 11 11 10
0 0
USA
ENGLAND
GERMANY
FRANCE
USSR
SPAIN
CZECHOSLOVAKIA
NETHERLANDS

IRELAND
ITALY

SWITZERLAND
OTHERS
SINGAPORE
CANADA

JAPAN

USA
GERMANY
ENGLAND
PEOPLES R CHINA
FRANCE
AUSTRALIA
CANADA
SPAIN
NETHERLANDS
ITALY
JAPAN
SWITZERLAND
SWEDEN
SOUTH KOREA
ISRAEL

SINGAPORE
DENMARK
SCOTLAND
FINLAND

AUSTRIA
OTHERS

Figure 2. Distributions of papers in Bioinformatics. For 2012, only countries with 10 or more papers are represented. In addition, the journal changed
752

its name in 1998 (but is the same underlying journal).

wileyonlinelibrary.com/journal/cem Copyright © 2014 John Wiley & Sons, Ltd. J. Chemometrics 2014, 28: 749–760
A short history of chemometrics

180
Citations to Geladi and Kowalski emphasis on cultural heritage applications), Turkey, North Africa
160 and a few others. These conferences tend to attract anywhere be-
140 tween 25 and 150 delegates dependent on year. Typically, around
120 three such meetings take place a year, resulting in another 350 or
Citations

100 so delegates, or a total of 600. Taking into account that a few peo-
80
ple attend more than one chemometrics conferences per year, we
arrive at around 500 delegates on average who are active scientists
60
and attend at least one dedicated conference in chemometrics
40 considered international per year. But there are also a number of
20 national and regional meetings, typically attended by between
0 20 and 50 delegates, and we will assume that this doubles the total
1985 1990 1995 2000 2005 2010 number to about 1000 a year. We will not count sessions in larger
Year conferences or courses but restrict this core to those that have
Figure 3. Citations per year to Geladi and Kowalski’s paper [25]. taken the time and trouble and paid the money to attend at least
one dedicated day or even week conference on chemometrics.
The number of people attending conferences in this area has
papers published in Bioinformatics by year correlates, with a remained quite static over the years. Although a new series of
value of R = 0.81, with the number of citations to [25], suggesting meetings are emerging, often advocated by one or more lead
a steady and universal growth in uptake of such methods. figure, so too are others ceasing. Important series of meetings that
We now take a jump and suggest that the average number of are no longer organised include COBAC, COMPANA, Czech
unique authors who would regard themselves as users of chemometrics conferences and the Gordon conference. Atten-
chemometrics methods, who contribute to each paper involving dance at the ICRM and Scandinavian Symposium on
some chemometrics, is two. Often, there are lone chemometrics Chemometrics is if anything declining. However, this is
users who are the laboratory data analysis expert, but in some counterbalanced by new conferences starting up.
papers, there are teams. Also, some authors publish more than In most chemometrics conferences, however, very few
one paper per year. However, this calculation suggests an esti- delegates are experts. Even many presentations are given by
mate of 12 000 unique chemometrics authors of papers currently non-experts. Most attendees do not have a strong grasp of statis-
actively publishing in any one year. Another assumption is that tical fundamentals of the discipline and are often happy to admit
about one tenth of all users of chemometrics publish a relevant this. There unfortunately is no corpus of knowledge that is
paper in any year. Many will be working in industry or govern- agreed to be essential for people to do chemometrics. However,
ment laboratories or as consultants. Some will be students not it is this author’s firm view that someone using a method should
yet ready to publish their work. Others will publish only understand the basic principles of the relevant method. For
occasionally or have their papers rejected or not be in an envi- example, if using an F-statistic, it is useful to understand the
ronment where it is encouraged to publish or may be instrument general idea of an F-distribution in qualitative terms. In turn, most
operators (e.g. near-infrared spectroscopy (NIR)) who almost confidence limits, for example, in SIMCA use F-tests: hence many in
automatically encounter chemometrics in their work. This results chemometrics and inherently using such distributions. Look at the
in an estimate that there are, at any time, around 120 000 questions of Table I. What proportion of delegates in a typical
researchers worldwide using some level of chemometrics and sug- chemometrics conference could answer these correctly? This au-
gests that, over the years, there has been an explosive number of thor would suggest it is only necessary to answer questions related
people encountering chemometrics in their research in some form to methods that are being used by the relevant researcher; even
or another. This number excludes students in courses that are not this author probably could not answer technical questions cor-
using these techniques for hands-on research. rectly about areas he has not worked in but would expect any
Hence, whereas there has been a huge growth in use of competent student or scientist to be able to give simple answers
chemometrics techniques, there has been limited change in to questions about methods he or she is using, for example, in pa-
the number of experts or people that would regard themselves pers or reports or presentations. And generally, a professional
first and foremost as chemometricians. Conference attendance chemometrics expert should perhaps be able to score 7 out of
is another indication of how many experts there are. We can 10 (if there were an agreed corpus that would be higher). But it
make estimates of the numbers. The biggest international is anticipated that only between 10% and 20% of delegates of a
conference currently is the Chemometrics in Analytical Chemis- typical conference would pass this test, even though many might
try, whose attendees may typically number 300 active scientists be claiming to use methods that fundamentally depend on topics
(we exclude managers, salespeople and so on who are not covered. Contrast this, for example, to a conference on organic
currently involved in research). The Scandinavian Symposium synthesis where every delegate could answer what an ester is or
on Chemometrics is traditionally regarded as the other major what an SN2 reaction is or identify electrophiles or know the differ-
conference, but with attendance in 2011 at 124 according to the ence between R and S, or a chromatography conference where
editorial in Journal of Chemometrics [26], the conference is not so every delegate would understand what an eluent is or what a
large. However, assume there is one major conference a year with mobile phase is or what a k′ value is or the difference between
attendance averaging 250 (a recent conference—International isocratic and gradient elutions. Perhaps, therefore, 20% or fewer
Conference on Perspectives in Chemometrics in India—exceeded delegates in international professional chemometrics conferences
this by a long way, but not all attendees were active scientists in would be at an expert level. This is partly because there is no
this area). There are then a number of other international confer- agreed knowledge base for the subject.
ences, including the well-established series in Hungary, and If we assume that 10% of all expert chemometricians attend a
753

dependent on year, conferences in Russia, Southern Italy (with an conference in any one year (many who are in industry or in

J. Chemometrics 2014, 28: 749–760 Copyright © 2014 John Wiley & Sons, Ltd. wileyonlinelibrary.com/journal/cem
R. G. Brereton

Table I. Some questions that can be used to test the general This gap is a particular feature of Western countries where
knowledge of a chemometrics expert chemometrics was born. Whereas there is still a small existing
corpus of experts, which has hardly changed in size and perhaps
1. Explain how the standard normal distribution, chi-squared contracted, there is in contrast an enormous growth in the
distribution, F-distribution and t-distribution are related number of users. Many laboratory scientists view data analysis
2. If two variables have a (Pearson) correlation coefficient of 0, as a hobby—the last thing performed when writing up a paper
does this imply that they are independent and why? or report on a Friday afternoon. They often spend months or
3. A standard deviation is often used to measure the distance years acquiring data and at the very last minute produce some
from the mean of a set of samples or measurements when sort of statistical analysis to include in their paper to please ref-
there is only one variable. What equivalent measure is used erees and editors.
when there is more than one variable? Unfortunately, when papers are submitted to journals includ-
4. Principal components analysis is performed on a dataset ing chemometrics, expert referees are far and few between.
and the PC axes are at right angles to each other; Many are overburdened, and many do not have the time, and
however, the product of the PC scores of different so only a small portion of papers in the 100 or more journals
components is not zero, why might this happen? containing chemometrics are actually refereed by an expert in
5. It is common to use 99% confidence limits, for example, to data analysis. And if a set of authors is rejected by a journal on
determine if a sample is an outlier or belongs to a the basis of the chemometrics content of their article, if the
predefined group. In order to model these limits paper is an application-based paper, they will usually submit to
confidently for a distribution, approximately how many another journal. In fact, application-oriented journals often have
measurements should you make? higher impact factors to more theoretical journals, but in many
6. Two groups show PC scores plots, which in practice cases, the standard of statistical analysis required is lower in
suggests that they are indistinguishable, but a PLS-DA more applied journals. Hence, a paper that has been rejected
scores plots suggest there is a substantial difference. Why by a lower-impact-factor, more theoretical journal may be
is this and which is the most meaningful way of accepted by a higher-impact-factor, more applied journal. As
presenting the data? the journals become more applied, papers tend to be judged
7. Why is it generally considered there must be less variables less by the quality of data analysis and more by the interest of
than samples when determining the Mahalanobis distance the applications.
from the centre of distribution, and how can this limitation In biological literature, statistics is routinely used but on the
be overcome? whole misunderstood. Writing in Nature, David Vaux [27] states:
8. Why can multivariate statistical process control be
considered a classification problem? …it is still common to find papers in most biology journals
9. What is the relationship between fractional factorial and contain basic statistical errors. In my opinion, the fact that
Plackett–Burman designs? these scientifically sloppy papers continue to be published
10. For a univariate normal distribution, the proportion of means that the authors, reviewers and editors cannot
samples found increases the closer they are to the mean. comprehend the statistics, that they have not read the pa-
Is this necessarily true for a multivariate normal per carefully, or both. Why does this happen? Most cell
distribution? Explain simply. and molecular biologists are taught some statistics during
their high school or undergraduate years, but the princi-
Answers (all qualitative and none requiring equations) are in
ples seem to be forgotten somewhere between gradua-
APPENDIX 1.
tion and starting in the lab. Often, the type of statistics
they learnt is not relevant to the kinds of experiment they
are now doing. And, once in the lab, people generally just
countries and groups where travel money is sparse or who do do what everyone else does, without always understand-
not have the time or are also involved in other areas will not reg- ing why.
ularly attend a conference every year) and between 10% and
20% of attendees of a conference could be classified as experts, Even (or especially) high-impact journals such as Nature have
we reach a worldwide estimate of between 1000 and 2000 ex- had problems with statistical assessment of data over the years.
perts internationally. This should be compared with the estimate In an editorial in April 2013, the editors state [28]:
of 120 000 researchers worldwide using some sort of
chemometrics, or somewhere between 0.8% and 1.6%. This Over the past year, Nature has published a string of articles
rather unusual situation is in contrast, for example, to synthetic that highlight failures in the reliability and reproducibility
chemistry where in practice anyone that goes into a professional of published research (collected and freely available at
research laboratory to make new compounds has a basic and go.nature.com/huhbyr). The problems arise in laborato-
largely agreed knowledge. ries, but journals such as this one compound them when
they fail to exert sufficient scrutiny over the results that
they publish, and when they do not publish enough
2.3. Friday afternoon chemometricians
information for other researchers to assess results properly.
The gap between users and experts is a particular problem for
chemometrics. In bioinformatics, in contrast, there is not likely Chemometrics suffers from the same problem.
to be such a gap. Almost every delegate in a bioinformatics con- Why is there this sort of difficulty with chemometrics, whereas
ference or every computational author of a bioinformatics paper it is hardly a problem in, for example, bioinformatics. And why
754

is likely to possess a basic level of expertise. cannot it be policed? There are so many difficulties in the current

wileyonlinelibrary.com/journal/cem Copyright © 2014 John Wiley & Sons, Ltd. J. Chemometrics 2014, 28: 749–760
A short history of chemometrics

usage of chemometrics methods. As an example, many people one-day or two-day courses in chemometrics and suggest that
apply statistical tests that are based on the assumption that data if one pays a fee, one automatically will acquire adequate knowl-
are normally distributed, yet much data in practice, especially in edge. Their boss may be swimming with data, has no idea what
growing areas such as metabolomics, fail normality tests to do, points to a member of his or her group and decides to pay
abysmally; these tests often are not available in the relevant for them to go on a course. This person, usually someone good
packages, so the users have no idea that the statistical methods at spreadsheets or who enjoys sitting in front of computers,
they use are inappropriate. Many people suffer from low sample- automatically becomes the lab ‘guru’ and then considers himself
to-variable ratios; many classical approaches for discrimination or herself a chemometrics expert and a font of wisdom. These
were developed and first applied (safely) to datasets where the people may be offered co-authorship of papers that contain a
number of samples far exceeded the number of variables. The little data analysis and then not only are considered an expert by
possibilities of overfitting when the number of samples exceeds others in the group but start to believe in it, sometimes putting
the number of samples are enormous; classical approaches were chemometrics as a skill on their curriculum vitae or applying for
not devised for these situations. The mean of a dataset is often jobs involving chemometrics. Would a medical doctor be allowed
taken very seriously, yet means are very prone to distortion if to apply for a job after just a few lectures and maybe one or two
there are outliers, especially for small datasets; in biological laboratory classes? He or she would not only have to pass a very
datasets, it is easy to obtain outliers. Plots such as PLS discrimi- rigorous course but also spend several years in a hospital gaining
nant analysis (PLS-DA) scores are often preferred in papers to experience under the supervision of fully expert professionals.
unsupervised PCA plots if they appear to show discrimination In contrast, in synthetic organic chemistry, this could never
as they better back up presupposed hypotheses: people often happen. No one in a professional synthesis laboratory would
prefer to publish ‘nicer looking graphs’ even though false separa- be allowed near a fume cupboard unless they had demonstrated
tions can be obtained using purely random data. Methods such a significant corpus of knowledge. But software companies and
as PLS-DA are often treated as integrated methods when in fact trainers often find that the easiest way to earn money is to try to
data scaling, centring, decision threshold and choice of number fast track laboratory staff to pick up chemometrics in a few hours.
of components are integral to performance: if only one compo- The big growth of chemometrics in Western countries over
nent is chosen under certain circumstances, PLS-DA is identical the past two decades has been the ‘Friday afternoon’
to Euclidean distance to centroids and with all non-zero compo- chemometricians who are part of a group that is primarily exper-
nents to linear discriminant analysis (LDA). How many under- imental. The Friday afternoon chemometricians do precisely this;
stand this or even realise what decisions have been made at the end of the week or the project, they spend a couple of
before they have their answer. Yet there are those who will hours generating a graph that they give their manager who
compare the performance of PLS-DA with that of LDA, not even and then puts it in a project report or paper. Three decades
realising they can be one and the same thing and very rarely ago, the majority of chemometrics users were experts. In some
describing the steps in adequate detail to make their paper at ways, chemometrics is a success in that the number of people
all meaningful or useful. There are many papers that ‘compare having access to the main method has grown enormously,
methods’ and regard incremental differences in performance maybe a hundredfold, in developed economies. Metabolomics
as significant: one method may have a 90% success rate and has been an especial flashpoint. Whereas there are definitely
the other 93%, so the latter is regarded as better. Yet the perfor- some well-established experts, the majority are not. This gap
mance of methods critically depends on the data that are used, is a serious one, and in the next section, we will look at what
how the validation is carried out, how an independent test set happened in the middle years to create this problem.
is formed, how the data are pretreated and so on. Many papers
claim that a new method is ‘better than any existing method’,
yet there is no single optimal universal method, that the success 3. CHEMOMETRICS IN THE MIDDLE YEARS:
of an approach depends on data structure, how much is known MONEY TALKS
about the data in advance and how the method was optimised
3.1. Academic finance
and validated, and that often different approaches cannot be
directly compared. The problems of poor experimental design Chemometrics emerged in the 1960s and 1970s when academia
and sampling strategy are also serious, but there is no room in this was organised quite differently to now. This author will illustrate
paper to list them in full. There are so many common problems this by developments in the UK, which parallel changes in the
that are buried deep in the literature and are unlikely to go away. financial organisation of most Western nations’ universities.
An expert, of course, could identify these pitfalls, but it often In the UK university sector, there were only 3273 higher
takes many years of practice analysing data to be in such a degrees (mainly PhDs) awarded in 1960, compared with
position. Would we want a surgeon operating on us if he or 194 270 in 2012 (inclusive of masters), a 60-fold increase [29].
she had not had extensive training? Would you want the com- The number of institutes calling themselves universities also
puter systems of banks to be programmed by people that had tripled over this period. In the 1960s and 1970s, it was a rare
little experience? Would you want an aircraft piloted by someone distinction to be working and studying at a university. Most univer-
without adequate training? Yet data analysis is often an essential sities in the UK, although constituted as private institutes, in prac-
part of complex scientific projects, sometimes with medical or tice received most of their funding from the state. With a much
environmental or industrial significance, and is frequently han- smaller university sector, state funding could keep up with modest
dled by people with limited experience. student numbers.
The problem is that many people think they can master In addition, people’s aspirations were limited. For many years
chemometrics in a very short period. There are many packages after the Second World War, a priority was a roof over one’s
available, and many labs think that by acquiring a package, they head, enough food and maybe a little left over for cheap beer
755

know everything. There are also a lot of companies that offer in the college bar or port at formal meals. Colleges had a concept

J. Chemometrics 2014, 28: 749–760 Copyright © 2014 John Wiley & Sons, Ltd. wileyonlinelibrary.com/journal/cem
R. G. Brereton

of fellows for life, who had a room in college and regular meals, later, when there was a more grant-based funding, accountability
in return for a lifelong commitment to the college, based on was not so important: this author remembers starting research in
teaching and some other duties such as running the boat club. Cambridge in the late 1970s and a young member of staff, who
University academic pay only became truly professionalised from rose to significant prominence in UK chemistry and wrote many
the late 1940s in the UK. In 1948, it was stated highly regarded papers, decided after the first few months of a
postdoctoral grant to change the subject area for their postdoc
The University (of Cambridge) had already accepted in to work in—this led to a major line of research and significant
March 1948 the proposal of the General Board that a innovation—and had no negative influence on the career of this
university lecturer should receive a ‘prime stipend’, which academic at all. When starting in Bristol as a staff member in the
should be adequate payment for teaching work and suffi- mid-1980s, this author came across an academic who regularly
cient to ensure to him the time for study and research. [30] received grants and decided that another project was much more
interesting than the one on the original proposal and talked to his
Whereas some scholars depended on their stipends for survival, postdoc who moved into this new field, and the staff member did
others had independent, often family, means and could view the original project (perhaps in a less comprehensive way) himself,
university activities more as a hobby, relying on their relations leaving the postdoc to move in another more interesting direction,
or inheritance or shares to fund what sometimes was quite an unrelated to the original proposal.
affluent lifestyle. In that era, housing costs were modest relative Because of the expansion of grant-based funding and
to wages, so better-off families could purchase good housing, increased cost of research, in part because scientists and
but they would not rely on their university pay for this purpose. researchers expected a more affluent lifestyle than previously,
University lecturers were just grateful for their position and, if in Western countries, the competition for funding became fierce.
they did not have the money, would have anticipated the univer- All universities in the UK are now expected to do research, yet
sity to look after them. Postgraduate students similarly only core government funding is limited, so they all struggle to obtain
expected a small stipend and might typically supplement their funds. An academic is now judged not just by excellence in
income by college teaching or outside work. Grants to study teaching or publication or graduating students, but in how much
for PhDs were extremely limited and mainly in the form of very money they can generate. Research areas that do not generate
competitive scholarships. Research students expected no more funding are often neglected. Young academics seeking a career
than bread on the table, a roof over their head and a cheap path look for areas that are cash rich. This trend is important
college bar to to while the evening away. Although this era for almost all scientific research in countries in Northern Europe
was starting to feel distant when this author started postgradu- and North America. A postdoc, for example, may not be essential
ate work in the late 1970s, still most of the more senior staff to carry forward a project, but he or she not only generates a
had been brought up in this environment and basically ran grant that looks good on a university’s balance sheet, but often
academia. In 1964, when the new University of Cambridge additional indirect money that can be spent on running courses
Chemistry Laboratory was built, there was a mysterious railing or constructing new buildings or employing administrators.
in the car park: this was because a prominent member of the Hence, money talks even if it is not needed.
staff (who did some pioneering work in organic structure This enormous financial culture change in Western countries
elucidation) came to work riding a horse and needed some- spans the era when chemometrics was born and the present
where to tie his horse up during the day. This author attended day and can be used to understand the development of the
his lectures, which were very entertaining, during the 1970s. subject. Other established areas, such as organic synthesis or
All these factors contributed to a very different financial feel quantum chemistry, were born in an earlier age and were
during the 1960s and 1970s in UK universities. Academic and already established academic disciplines by the 1960s so did
postgraduate pay, and financial expectations, was low. There not have to develop under such a changing environment:
were less universities and students to support than now. PhD departments and committees and studentships were already
students did not expect to buy cars or go on foreign holidays established in these area by the 1970s. Academics tend to think
or live in private houses. The pool of researchers needing in boxes; he or she would need to set up a new box if
funding was quite limited. Most grant committee members chemometrics was to gain general recognition as a discipline
knew each other and knew who was a good researcher. Univer- in the future.
sities in the UK, as in many countries outside the USA, relied
primarily on government funding.
3.2. How chemometrics was financed
Obtaining finance for research was not considered the top
priority of an academic in the 1960s. If an academic did not need A small number of pioneers of chemometrics in the 1970s and
funds and could develop his or her own ideas with minimal 1980s were good at obtaining funding. During this era, centres,
money, this was considered enough: to write good papers and software companies and a few large groups, funded primarily
graduate good students and give good lectures was considered by industry, were set up. Applications such as multivariate statis-
sufficient. A few senior figures did fight for grants, for example, tical process control (MSPC) and NIR, especially in petrochemical
to acquire equipment such as NMR spectrometers, but even and food industry, represented a real, near-market need. Hence,
then, it was not such a struggle. This author remembers pioneers followed the money, and a significant portion of papers
attending a lecture in the 1990s by the late Nobel Prize winner over that era in what we would now recognise as chemometrics
Professor Sir Geoffrey Wilkinson who recalled that in the 1960s, were published to tackle these economic problems. Some of the
he wanted an NMR for his department (Imperial College, London): earlier applications such as NMR spectroscopy and chemical
he was told to go back and write a one-page piece of A4 as to why pattern recognition, whilst still important during that era, re-
it would be useful. Well, it was useful and catalysed a whole ceived much less attention in chemometrics conferences and
756

fundamental area of organometallic chemistry research. And even journals. Still, there are some that consider chemometrics

wileyonlinelibrary.com/journal/cem Copyright © 2014 John Wiley & Sons, Ltd. J. Chemometrics 2014, 28: 749–760
A short history of chemometrics

and NIR as almost interchangeable: cheap online industrially 3.3. Why is there limited mainstream finance?
useful NIR probes were also being developed at the time, so
this was a natural symbiosis. One early and promising lead towards academic respectability
The problem with this is that industrial funding tends to be for chemometrics was via statistics. In the first few years, there
very fickle. Industry rarely funds institutes. It usually funds small were several academic and industrial statisticians that published
short-term projects. Industries will buy into a centre if they are in the chemometrics literature and contributed to conferences.
convinced that the centre will help solve their problems, but Jerome Friedman is exemplified as a leading established multi-
the centre is still dependent on keeping money coming in. Most variate statistician from the USA. In the UK, there was a long-
of the industry-funded initiatives are dependent on one or two established tradition of agricultural statistics, and in Scandinavia,
front-line figures and rarely involved more than a couple of the statistician Herman Wold influenced his son. Many early pa-
established staff in any one organisation. Most centres lived pers by chemometrics experts were published in the journal
and died with their founders and so had a limited lifetime. Technometrics. However, over the last few years, there have been
Industry often supports individual research leaders but rarely few statisticians active in chemometrics: go to any modern
departments. For a university to have the confidence to set up chemometrics conference, and it is rare that there are any
a department, with a permanent staff of maybe 5–10 (such as or- presenters that view themselves primarily as statisticians. This
ganic chemistry or analytical chemistry), they need to be confi- contrasts, for example, to biometrics where almost all active re-
dent in its long-term prospects, especially financially. Many searchers regard themselves as trained statisticians.
decades ago, departments could be established based on pure This has had significant knock-on consequences for funding.
academic need, but this is now rare. A university is unlikely to Chemometrics is regarded by many academic statisticians as
establish a department based on a single front-line figure, unless ‘second-class statistics’ and so not sufficiently novel to attract
he or she has very senior status, such as a Nobel Prize winner. So such funding. In fact, chemometricians should regard their
although there were several significant research groups, subject as ‘first-class chemometrics’ and very innovative and
consisting of between one and three academic staff and challenging but complementary to and separate from mathe-
between 10 and 20 research students, postdocs and visiting matical statistics. A statistician has a knowledge base that he or
fellows in the 1970s and 1980s, these have not metamor- she would regard as essential for safe professional practice,
phosed into established departments. The number of front- rather like an organic chemist having a required knowledge base
line figures in chemometrics who are also able to raise very before he or she can enter a professional laboratory and a med-
significant funds has stayed approximately constant over ical doctor having mastered a commonly accepted range of
the past few decades at around six or seven worldwide at techniques. The chemometrician does not need a full range of
any time; when old ones retire or die or move into other pur- statistical techniques and as such may be regarded as lacking
suits, new ones replace them. by the statistician. This under no circumstances means that a
A subject such as bioinformatics has not suffered such a fate. chemometrician can ignore sound statistical principles when
In the 1980s, bioinformatics arguably was smaller in size com- necessary. It is essential to validate models. Good statistical ex-
pared with chemometrics, but much has changed with the perimental design is often necessary. Understanding distribu-
human genome project. This put genetic profiling firmly into tions is necessary when performing statistical tests. They need
the mainstream, with the associated need for data mining and to understand that there is no such thing as a single universal
pattern recognition. The subject rapidly gained respectability. optimal method. But to safely practice chemometrics, only part
Importantly, the subject could attract significant non-industrial, of traditional statistics is needed. Chemometricians though also
especially government, funding. This is partly because it tapped need other skills. They need to understand where the limitations
into a huge biomedical research base. Enormous money was are in the analytical process. They often need to be able to
invested in genetic profiling in the search for diagnostics, improve the quality of analytical signals. They need to compromise
therapeutics and control of diseases such as cancer, and vast between sample sizes, analytical reproducibility, well-designed
departments and institutes rapidly established. Universities and experiments and optimising methods used for multivariate analy-
research institutes see that this area has a solid long-term poten- sis and put each step in perspective. Hence, there is such a thing
tial, and also students and early-stage researchers see that as ‘first-class chemometrics’, and this is not second-class statistics.
money and so careers and jobs are likely to remain available But chemometrics does not normally sit easily within academic
for the long term. Therefore, departments, funding committees statistics although there is a relationship, and so statistics funding
and courses in bioinformatics had a long-term basis. Industry bodies often do not view chemometrics as sufficiently fundamen-
also funds bioinformatics research, but if industry is to be tal, and as such, there are very few well-financed pure
persuaded to part with significant funds to support academics, chemometrics projects in academia.
it usually likes to co-fund initiatives: the academic base will be Within applied science, such as chemistry and biology, the use
there before and after the, often, short-term industrial funding, of chemometrics methods such as PCA and PLS is increasing rap-
so they can dip in and out of established departments and idly. But the problem here is that it is hard to justify funding pure
groups. They would rarely want to completely fund a significant chemometrics projects. Most application scientists view data
external initiative over a long period, and if the need arises, they analysis as the last thing that is carried out at the end of a project
would usually prefer to employ staff within their own laborato- and cannot visualise this as an end in itself. A proposal in this
ries. Hence, chemometrics has largely failed to develop area may compete with a synthetic chemistry or mass spectrom-
established broadly based academic departments, in contrast etry or inorganic catalysis project, and with intense competition,
to analytical chemistry or statistics or bioinformatics or organic the chemometrics proposal is likely to fail. Chemists in particular
chemistry, and is centred around a small number of lead figures cannot understand why they should fund someone to do data
that are good at obtaining money-forming research groups that analysis. Instead, what has happened is that applied scientists
757

live and die with these lead figures. try to do chemometrics on the cheap, by picking a student or

J. Chemometrics 2014, 28: 749–760 Copyright © 2014 John Wiley & Sons, Ltd. wileyonlinelibrary.com/journal/cem
R. G. Brereton

laboratory member who then ‘learns a package’ and becomes and keep churning out papers. He said: ‘It’s difficult to ima-
the data analysis guru. It is not really possible to properly pick gine how I would ever have enough peace and quiet in
up chemometrics in a few afternoons, proper data interpretation the present sort of climate to do what I did in 1964.’
is highly skilled and takes years of experience and practice. But if
there are few funds available for someone to take 3 or 4 years Most Nobel Prizes are awarded for work performed many
mastering this, the best that can be done is to spend a few decades previously, and as such, it will be interesting to see
afternoons. whether the modern tendency towards large collaborative pro-
Commercial software companies advertise their all-purpose jects will result in so many demonstrably original discoveries
packages; just pay for a licence, go on a course of a few after- and ideas.
noons and all will be solved. Hence, a supervisor or manager Whereas there is not much funding for chemometrics as a
looking for some chemometrics, rather than pay for 3 or 4 years’ pursuit in its own rights, there are some areas, notably
training such as via a PhD, is likely to take the cheap option. metabolomics, where funding opportunities are significant.
Because it does not take long to learn to use a package, put Whereas in the 1980s, many chemometrics experts followed
some data in and produce a PC plot, which for example, may the money into NIR and MSPC, now there is a mass exodus into
be used in a paper, the supervisor feels his group or section metabolomics. With previous waves of interest in genomics and
has obtained adequate chemometrics knowledge by this route proteomics, metabolomics has become a trend of the day.
for their purpose. The laboratory data analysis guru often does Hence, many with either in-depth or Friday afternoon knowl-
not have the time for in-depth study; often, this is a small part edge of chemometrics are seeking money from metabolomics,
of a PhD or a project, and they have to move on to produce pa- which is heavily funded by medical research councils, charities
pers, apply for the next grant, write a thesis, submit a CV for the and governments, especially in the chase for biomarkers to be
next job and so on, hence the growth of the Friday afternoon used for diagnostics of disease. Despite this huge funding, in
chemometrician. Most funding bodies do not see chemometrics real-life clinical practice, there has been little or no success in
as a subject in its own rights, just as part of a larger project. This changing modern-day medical practice.
has resulted in the slow decline of dedicated chemometrics There is and always will be significant public interest in medi-
groups, and the huge growth of the part-time chemometrician. cal advances; medieval and classical medicine was based on
Because most academics are judged by the funding they bring what we regard as misplaced theories of humors, and we would
in (in the Western world), they are under pressure to invent not now view these often rather brutal approaches as suitable in
themselves in a highly funded application-oriented area, where the modern day, yet people believed in them and researched
they can obtain significant funds. them and wrote about them and paid money to medical practi-
There is little that can be done. There are no strong interna- tioners. Who knows whether the current work in metabolomics
tionally accepted charter marks for chemometrics experts, will lead to real medical advances, or whether it will be left
unlike, for example, statisticians or chemists, and it is financially behind in centuries to come, a rather unusual historical curiosity?
cheaper to hire a Friday afternoon chemometrician than pay There are a few chemometrics experts working in this area, who
the salary of an expert. Obviously, there continue to be a few are doing demonstratively excellent work and have moved there
excellent dedicated and well-funded chemometrics centres and as it attracts funding (the work on plants has a better chance of
many outstanding prominent individuals, but as discussed realistic success as cloning and environmental conditions are
earlier, these have not significantly increased in number over much easier to control, but its extension into medicine is far
the past couple of decades. Yet there are a very large number more debatable). But the majority of studies are dangerous,
of people who claim to be chemometricians whose main experi- carried out by Friday afternoon chemometricians, often using
ence might be involving learning to use a commercial package poor sample sizes, not understanding problems of validation
and obtain a PC scores plot or perform a PLS calibration. and so on. It is very easy, for example, to obtain a separation be-
tween two groups using PLS-DA and random data [32]; with a
small sample-to-variable ratio using a training set, false separa-
3.4. Where is the funding now?
tions can be very easily generated. A project team keen to pub-
Most scientists in academia now have to follow the money. The lish results to demonstrate that their work was useful for their
tendency for most large grants is to involve collaborative work, sponsors, who has access to any number of commercially avail-
so most are attracted into collaboration. A good chemometrician able packages, will choose the graph that appears to show their
should be a good collaborator of course, but many large collab- hypothesis in its best light. In many cases, referees (who might
orative grants are often unsuccessful in achieving their aims, but be medically or biologically trained although they perhaps took
successful in bringing lots of groups together and pleasing a module on statistics) will not understand much about valida-
politicians and funding bodies. tion, especially in application-oriented journals, and the authors
Modern grant-funded research creates academic pressures will use this as justification for publication, which is then treated
that militate against long-term fundamental research, as this as a fact. A careful analysis of the data that conversely suggested
report in the UK the Guardian comments when interviewing that there are no significant trends in the data, perhaps a much
the recent Nobel Prize winner Peter Higgs [31] suggests. more significant and authoritative paper, is unlikely to be pub-
lished: few journals like to publish negative results even if these
Peter Higgs, the British physicist who gave his name to the could be very significant; a negative result is seen as a failure
Higgs boson, believes no university would employ him in although it may well have been the outcome of meticulous
today’s academic system because he would not be consid- experimental planning. Indeed, in medicine and biology, it is
ered "productive" enough. … He doubts a similar break- very common for there to be numerous publications backing
through could be achieved in today’s academic culture, up a hypothesis for which there is no real statistically sound ev-
758

because of the expectations on academics to collaborate idence, often because this is a convenient way to obtain funds,

wileyonlinelibrary.com/journal/cem Copyright © 2014 John Wiley & Sons, Ltd. J. Chemometrics 2014, 28: 749–760
A short history of chemometrics

and chemometrics if mishandled can appear to support almost environmental consequences, this may result in a retreat to the
any hypothesis so can be widely misused without anyone basics. Core chemometrics, despite its failure to capture large
noticing. The medical experts on panels will often continue to funds and to develop a large and permanent academic base in
fund these lines as the papers published appear promising Western countries, is nevertheless very well organised, with
(even though the data analysis may be misleading). If an investi- properly acknowledged textbooks, conferences and journals. It
gator reported what they viewed as a negative finding, whilst does not need to be rebuilt from the beginning; all the founda-
perhaps being very meticulous, they may never receive further tions are there, and the practice has been developed into a fine
money in this area, and very possibly their career and group art. Hence, we cannot tell whether there will be a reaction
may be finished. against the Friday afternoon chemometrician, and expertise will
Hence, chemometrics, for financial reasons, is in danger of be- again be universally respected. Data mining is an immensely
ing buried within application science, especially metabolomics. skilled task requiring years of experience and training, and there
There are and always will be a very small number of experts needs to be financial and academic acknowledgement that this
who will have abandoned mainstream chemometrics to follow sort of expertise is a requirement for the interpretation and
the financial opportunities, but the vast majority of papers are analysis of large multivariate datasets obtained from laboratory
in practice published by those with limited expertise. Hence, this instrumentation. We would not want a surgeon to perform an
trend accounts for the relative reduction in core chemometrics operation without adequate experience and training, so why
capability in Western countries whilst resulting in a huge expan- should we lower the expectations in chemometrics?
sion in the number of Friday afternoon chemometricians. The future will be interesting and unpredictable. As usual,
human, financial and political events may determine the survival
3.5. Eastern nations and development of the subject.

As has been discussed earlier, whereas core chemometrics capabil-


ities have declined slowly in Western nations, such as Northern
Europe and North America, to be replaced by a significant growth REFERENCES
of part-time chemometricians, this trend is not so evident in 1. Lederberg J, Sutherland GL, Buchanan BG, Feigenbaum EA,
Eastern countries. Robertson AV, Duffield AM Djerassi C. Applications of artificial intelli-
A good example is the enormous growth in chemometrics gence for chemical inference. I. The number of possible organic
capabilities in Iran [33] over the past two decades. Their funding compounds: acyclic structures containing C, H, 0, and N. J. Am. Chem.
Soc. 1969; 91: 2973–2976.
of universities is still quite similar to the UK mechanism of the 2. Gray NAB. Dendral and meta-dendral—the myth and the reality.
1970s. Academics are not primarily evaluated by their income- Chemom. Intell. Lab. Syst. 1988; 5: 11–32.
generating prowess, but by the quantity and quality of papers 3. Corey EJ. Wipke WT. Computer-assisted design of complex organic
and by how many students they graduate. Students often study syntheses. Science 1969; 166: 178–192.
4. Lee TV. Expert systems in synthesis planning: a user’s view of the
for the sake of knowledge. Hence, a new area such as LHASA program. Chemom. Intell. Lab. Syst. 1987; 2: 259–272.
chemometrics can develop its own disciplinary structure without 5. Gavroglu K, Simões A. Neither Physics nor Chemistry: A History of
worrying too much about funding, and there are many strong Quantum Chemistry, MIT Press: Cambridge, MA,2012.
centres and established academics working in the area. So long 6. Hehre WJ, Lathan WA, Ditchfield R, Newton MD, Pople JA. Gaussian
as the funding mechanism for universities remains as it is 70 (Quantum Chemistry Program Exchange, Program No. 237, 1970).
7. Hansch C, Leo A. Substituent Constants for Correlation Analysis in
currently, we may see activities in core chemometrics growing Chemistry and Biology, John Wiley & Sons: New York,1979.
in Iran and other nations especially in Asia, whilst declining in 8. Selassie CD. History of quantitative structure–activity relationships. In
Western nations. Burger’s Medicinal Chemistry and Drug Discovery (6th edn). Volume 1:
Interestingly, trends in Southern Europe are intermediate be- Drug Discovery, Abraham DJ (ed.). Wiley: New York, 2003.
9. Malinowski ER, Weiner PH Levinstone AR. Factor analysis of solvent shifts
tween Western and Eastern Nations, probably because they still in proton magnetic resonance. J. Phys. Chem. 1970; 74: 4537–4542.
have a relatively poor national funding base and comparatively 10. Jurs PC, Kowalski BR, Isenhour TL, Reilley CN. Investigation of
low industrial research base but at the same time are involved combined patterns from diverse analytical data using computerized
particularly in European projects that have the possibility of lifting learning machines. Anal. Chem. 1969; 41: 1949–1953.
11. Jurs PC, Isenhour TL. Chemical Applications of Pattern Recognition,
their university system towards the North European model.
Wiley: New York,1975.
12. Massart DL, Janssens C, Kaufman L, Smits R. Application of the theory of
graphs to the optimization of chromatographic separation schemes for
4. WHAT OF THE FUTURE? multicomponent samples. Anal. Chem. 1972; 44: 2390–2393.
13. Kowalski BR, Brown SD, Vandeginste BGM. Editorial. J. Chemometr.
A historical perspective can look primarily at the past, although it 1987; 1: 1–2.
is possible to glance into a crystal ball. This article has primarily 14. Wold S. Spline functions, a new tool in data-analysis. Kemisk Tidskrift
focussed on how chemometrics has changed in the last few 1972; 3: 34–37.
decades and offers a financial interpretation of its development. 15. Miller JN, Miller JC. Statistics and Chemometrics for Analytical Chemis-
Chemometrics has not moved in the way that may originally try, fifth edition. Pearson: Harlow, 2005.
16. Davies OL. Statistical Methods in Research and Production, Oliver and
have been envisaged. This probably was due to the way the sub- Boyd: London,1947.
ject was funded in the 1970s–1990s, with a focus on short-term 17. Deming SN, Morgan SL. Simplex optimization of variables in analyt-
industrial grants rather than establishing a solid academic base, ical chemistry. Anal. Chem. 1973; 45: A278–A279.
in Western nations. There has however been a very rapid growth 18. Kowalski BR. Chemometrics, Mathematics, and Statistics in Chemistry,
in users, and packages and companies. Perhaps the bubble will NATO ASI Series C, Mathematical and Physical Sciences, Vol., 138
DReidel Publishing Company: Dordrecht,1984.
burst. If some enormously well-funded biological or medical 19. Sharaf MA, Illman DL, Kowalski BR. Chemometrics, Wiley: New York,1986.
project develops a test or diagnosis or product based on faulty 20. Brereton RG. Fourier transforms: use, theory and applications to spec-
759

data analysis or experimental design, with serious health or troscopic and related data. Chemom. Intell. Lab. Syst. 1986; 1: 17–31.

J. Chemometrics 2014, 28: 749–760 Copyright © 2014 John Wiley & Sons, Ltd. wileyonlinelibrary.com/journal/cem
R. G. Brereton

21. Brereton RG. Monitor and tutorial section: editorial. Chemom. Intell. samples that are very many Mahalanobis distance units
Lab. Syst. 1986; 1: 5. from the centre. But most common symmetrical distribu-
22. Brereton RG. Chemometrics: Applications of Mathematics and Statistics
to Laboratory Systems, Ellis Horwood: Chichester,1990. tions differ at the wings, which define for example the
23. Massart DL, Brereton RG, Dessy RE, Hopke PK, Spiegelman CH, 99% limit. In order to obtain a good idea of the shape in this
Wegscheider W. Chemometrics Tutorials, Elsevier: Amsterdam,1990. area of a distribution, we need to observe sufficient
24. Sibisi S, Skilling J, Brereton RG, Laue ED, Staunton J. Maximum samples to model the shape adequately. Between 10 and
entropy signal processing in practical NMR spectroscopy. Nature
100 samples are needed to obtain a good feel or the shape,
1984; 309: 801–802.
25. Geladi P, Kowalski BR. Partial least squares regression—a tutorial. which for 99% limits will occur between 1000 and 10 000
Anal. Chim. Acta 1986; 188: 19–32. times on average. It is very common to use 99% as an
26. Rinnan A Scandinavian Symposium of Chemometrics 12 in Hotel important decision threshold, yet much data in chemometrics,
Legoland. Denmark. J. Chemometr. 2012; 26: 423–424. for example for environmental and metabolomic samples,
27. Vaux DL. Research methods: know when your numbers are signifi-
cant. Nature 2012; 492: 180–181.
often fail (multi)normal distribution tests miserably. Higher de-
28. Editorial, reducing our irreproducibility. Nature 2013; 496: 398. cision thresholds such as for example 99.99% (or p = 0.0001)
29. Bolton B. Education: historical statistics, House of Commons Library, are much safer as they are so far from the mean that almost
Standard Note: SN/SG/4252 (2012) any distribution would determine that the test sample is not
30. Roach JPC, Ed. The University of Cambridge: Epilogue (1939–56), A part of the reference population.
History of the County of Cambridge and the Isle of Ely: Volume 3: The
City and University of Cambridge (1959), pp. 307–312, Victoria County (6) The most meaningful representation (for a training set) is
History, Boydell and Brewer. the PC scores plot. If there is a large variable-to-sample
31. Aitkenhead D. The Guardian, Friday 6 December 2013, url http:// ratio, which is typical of modern instrumental data, it can
www.theguardian.com/science/2013/dec/06/peter-higgs-boson- be shown that PLS-DA scores plots show apparent discrim-
academic-system
32. Brereton RG. Consequences of sample sizes, variable selection,
ination even using random data. By analogy, toss an unbi-
model validation and optimisation for predicting classification ability ased coin 10 times (equivalent to 10 samples) and repeat
from analytical data. Trends Anal. Chem. 2006; 25: 1103–1111. this 1000 times over. There will be many situations in which
33. Naseri A, Bahram M. A perspective on the growth of chemometrics in the coin turns up H 8 times or more. By analogy to PLS-DA,
Iran: a glance into activities between 2005 and 2012. J. Chemometr.
the experiments whereby the coin turns up H 8 times or
2013; 7: 263–277.
more could be selected, and the other experiments
ignored or deemed of little significance, and it will appear
the coin is biased. It is very common, especially in biology,
APPENDIX to prefer PLS-DA scores plots to PCA plots as on training
sets, they often appear to show the proposed separation
Answers to questions of Table I between groups more clearly, but this can create a
false impression.
(1) When there is one variable, the chi-squared distribution, the (7) Because if there are more variables than samples, the
abscissa (sometimes called the x-axis) represents the squared variance–covariance matrix does not have an inverse. It
distance from the mean in units of standard deviations, for can be overcome by performing PCA and taking some or
an underlying normal distribution. The t-distribution is equiva- all non-zero PCs, or alternatively by reducing the number
lent to the normal distribution when the number of samples is of raw variables.
low. The F-distribution relates to the t-distribution in the same (8) MSPC can be considered a one-class classification problem.
way the chi-squared does to the normal distribution. When The in-control samples form a single group, and methods
sample sizes exceed about 20 the t-distribution and normal for one class classification such as SIMCA or support vector
distribution and also the F-distribution and chi-squared distri- domain description are used to see whether an unknown
bution are very similar. When there is more than one variable, sample fits into the in-control group often using some sort
squared distances from the mean are used as there is no of statistical confidence test.
positive or negative direction. All four distributions assume that (9) Fractional factorial designs can be used when there are 2N
the underlying population is drawn from a normal distribution. experiments, allowing a maximum of 2N 1 variables to be
(2) There is no reason why. If two variables are independent, the studied, that is, 3, 7, 15, 31 and so on. Plackett Burman designs
correlation coefficient is 0, but not necessarily the converse. allow up to 4N 1 variables to be studied using 4N experi-
As an example, take a variable x that is symmetric around 0. ments, that is, 5, 7, 11, 15, 19 and so on. When the number of
Take another variable y = x2. The variable y is dependent on experiments is the same for the two designs, they are equiva-
x, but the two variables have a correlation of 0. Independence lent as can be shown by permuting the rows and columns.
and correlation are commonly confused in chemometrics. (10) No. As the number of variables increases, although the
(3) The Mahalanobis distance. When there is only one variable, the density of samples is most at the centre, the surface of an
Mahalanobis distance is the same as the standard deviation. imaginary hypersphere around the mean increases, and
(4) This happens if the data are not mean centred. In so the probability of finding a sample is not necessarily
chemometrics, it is quite common to use non-mean-centred the maximum in the centre. This can be verified by the
data, for example, in spectroscopy or chromatography, where chi-squared distribution. With 1 degree of freedom, the
there is often interest in variation above the baseline. maximum probability of finding a sample is at the centre.
(5) Between 1000 and a 10 000. For most distributions, the But by the time there are three variables, the maximum
centre points are often easy to model and, in many cases, is no longer in the centre. Similar comments apply to
resemble a normal distribution. It is often easy to identify the F-distribution.
760

wileyonlinelibrary.com/journal/cem Copyright © 2014 John Wiley & Sons, Ltd. J. Chemometrics 2014, 28: 749–760

You might also like