An Overview of Multivariate Data Analysis
1. INTRODUCTION
Over the past quarter century, the technology of computation has experienced a revolutionary
development which continues unabated, so that sophisticated analyses of large and complex data sets
are becoming rapidly more feasible, while repeated analyses of small-to-moderate sized data sets can
be virtually instantaneous from table top terminals in the statistician’s office. The hope has often been
expressed that the technological revolution in computation would bring in its wake a comparable
revolution in data analysis techniques, so that the science of data analysis would jump to a discernibly
higher level of organization, power, and beauty. Mathematical statistics could then have a regeneration
of its own as it faced the mathematical problems growing out of the new formulations of data analysis.
Progress towards the projected new era has so far been limited, but many areas are visibly taking shape,
and the next quarter century is full of promise as a period in which notable developments will appear.
This paper concerns directions of conceptual development which, in the view of the author, will
generate important components of the desired forward push.
Statisticians will increasingly require concepts adequate to frame analyses of complex highly
multivariaA a te data sets. But many academic statisticians have tended to define multivariate
analysis narrowly, excluding even such obviously multivariate data types as factorial experiments,
contingency tables, and time series. A preferable viewpoint would be to start with ordinary “univariate”
data as the simplest case of multivariate data-relating one substantive variable (like weight) to an
indexing variable (labelling the animals weighed)-and to place no rigid limits on the varieties of data
types to be called multivariate. The standard types of textbooks of multivariate analysis (for example, [2,
8, 331) present basic and elegant techniques built around multiple linear regression, correlations
including canonical correlations, multiple linear discriminant analysis, and principal component analysis.
The narrow range here reflects categorizations more natural for mathematical statistics than for applied
statistics. It is encouraging that the direct predecessors [27, 281 of the new Journal of Multivariate
Analysis embrace a broad outlook.
Theorists of multivariate analysis clearly need to venture away from multivariate normal
models. One might also hope for less emphasis on technical problems within specific theories of
inference, whether frequentist or Bayesian, and whether decision-oriented or conclusion-oriented.
Instead, attention should be directed towards questions posed by data, for example: It is plausible to
ignore certain possible relations ? How should one sift through large arrays of interaction-type
parameters to find those which are likely to be reproducible ? What simple systems of functional forms
of distributions are likely to be adequate to fit the data without being overly rich ? What are the criteria
which measure fit? Can simplifying structure be acceptably introduced into complex models, as by
specifying a prior distribution of parameters which itself has estimable parameters ? What quantities of
importance does a certain model predict ? What quantities can be robustly estimated, and what others,
such as properties of tails, are sensitive to distribution assumptions? How are wild values to be detected
and handled? Eventually, a more utilitarian and catholic viewpoint may lead to new categorizations of
the subject of statistical analysis, and thence to new emphases in theoretical statistics.
The data analyst envisaged here is a professional who sits in a central position among
investigators with data, theoretical statisticians and computer specialists. Among these traditional
scientific types, he should serve the objectives of communication and integration. His direct
contributions are to the development of new techniques and to the accumulated experience in the use
of many techniques. For this, he needs the wisdom to evaluate proposed techniques along dimensions
of efficiency and resistance to error, both statistical and computational, and along the dimension of
relevance to the substantive scientific enterprise involved.
2. ASPECTS OF DATA STRUCTURE
A primary need is for a framework which permits adequate classification and description of a
data set for data analytic purposes. The framework sketched below goes beyond the logical or network
aspects of data structure (see Section 2.1), to include basic ways of thinking about data which help bring
out meaning (Section 2.2), and to include formal mathematically expressed hypotheses (Section 2.3).
The three levels of data structure are ordered in the sense that the second draws on the first while the
third draws on the first two. Together they are conceived as providing the basic ingredients for the
analyst’s model of his data.
2.1. Logical structure.
Any data set can be represented as a list of values of variables where each value must be tagged
by identifiers for the variable, for the type of unit, and for the specific unit described by the value. For
example, the value 121 may refer to the result of an I.Q. test devised for 8-year old children generally
and applied to John Doe in particular. Usually, however, a data set can be represented much more
compactly than in the list form with every value tagged by three pieces of information. Compactness is
possible because the data have a logical structure defined by interrelationships among the variables and
units involved.
Understanding the logical structure of a data set provides more than a guide to efficient physical
representation of the data. Such understanding is also a fundamental prerequisite for understanding at
the deeper level of substantive scientific information conveyed by data.
The practice in statistics and in most fields of application has been to treat each data set on an
ad hoc basis, i.e., to obtain a grasp of the logical structure and to use this knowledge in data
representation and in motivating analyses, but still to leave the understanding of structure implicit in the
sense of not relating the particular structure to a highly developed, inclusive, multifaceted and formal
typology of data structures. Indeed, in the present state of the art, only a rudimentary description of the
required typology is available. Improved descriptions may yield dividends in uncovering important
directions of progress for multivariate data analysis.
The basic concept is variable. Variables can be conceived as substantive variables or indexing
variables (cf. Section 2.2). Usually a given variable is viewed as predominantly one or the other, but a
dual viewpoint is always possible. Thus, I.Q. is usually regarded as a substantive variable related to
educational psychology, while the names of 8-year old children are mainly values of an indexing
variable. But the latter could convey information about national origins, and thus become substantive,
while I.Q. could be regarded formally as a device for stratifying individuals, for example, to construct a
frequency distribution, which is more akin to indexing individuals than to evaluating them in some
meaningful way. Since considerations of logical structure relate primarily to the indexing or stratifying
aspect of variable conception, a variable should be viewed in Section 2.1 as a logical device which groups
units into categories, viz., categories defined by a common value of the variable.
It is clear that an important piece of information about each individual variable is the
mathematical space in which the variable takes values. The chief instances are (a) dichotomy, such as
YES or NO, (b) nonordered polytomy, such as 4 different chemical drug treatments, (c) ordered
polytomy, such as GOOD or FAIR or POOR, (d) integer response (usually nonegative, often counts), and
(e) continuous response, at least with enough digits recorded to make the assumption of continuity an
adequate approximation. Mixtures of these types also appear. The description of the logical structure of
a given set of data begins naturally with a list of all the variables involved, together with a description of
the space of possible values for each variable.
A suggested second element in the description is a tree structure where each node corresponds
to one of a complete set of k variables (including indexing variables) and where the k nodes are
interconnected by a set of k - 1 directed lines. It should be emphasized that each node corresponds to a
variable as a concept, not to a specific value or data point. Figures la,b,c,d illustrate typical variables. A
variable with no branch directed into its node may be called a root variable. Most often, a single data set
has only one root variable which is the indexing variable of the individuals or units or atoms described
by the remaining variables. The nodes corresponding to the remaining nonroot variables each have an
entering branch coming from the node corresponding to the indexing variable of the set of units to
which the variable applies. The standard example of multivariate analysis, as illustrated in Fig. la has a
single type of unit measured on a set of substantive variables. Figure lb illustrates how there can be
more than one root variable and, correspondingly, more than one branch entering a nonroot variable. If
a nonroot variable is conceived as an indexing variable whose possible values or levels are susceptible to
further classification, then several branches may emerge from the node of an indexing variable, as
illustrated in Fig. lc. These hierarchical structures are familiar to practitioners of the analysis of variance,
where the tree structure implies a list of possibly meaningful mean squares which can be computed
from the data.
Figure 1 d illustrates two kinds of issue which may complicate the use of a tree structure of
variables. First, the choice of what to identify as units and as variables is not necessarily unambiguous. A
common example involves time as an index variable. If an animal is weighed at successive points of time,
the structure may be conceived as a string of different weight variables defined over the indices of
individual animals. An alternative which may often be more streamlined is to create a new indexing
variable of (animal, time) pairs, so that animal index, time index, and weight become three variables
defined over the new indexing variable. The second issue concerns what may be called a conditioned or
f&red variable, viz., a variable defined only over a subset of the values of an indexing variable, where the
subset is determined by the values of one or more variables along directed branches growing out of the
same indexing variable. The examples shown in Fig. Id relate to certain variables being unavailable after
death and certain variables being available only on one sex. The device suggested for denoting a
conditioned variable is a second type of directed branch proceeding from the conditioning variable to
the conditioned variable, represented as a dotted line in Fig. Id.
It is interesting that the conditioned variable concept introduces a type of nesting in which
variables are nested within categories of other variables, while the hierarchical tree structure illustrated
in Figs. lc and Id introduces nesting in which units are nested within entities of a different type. These
two types of nesting are logically distinct and should not be confused with each other.
The two aspects of variable structure described above, viz., the aspect of space of values and the
aspect of tree structure, are nonspecific in the sense of applying to many possible data sets involving the
same variables rather than applying specifically to a given data set whose structure is desired as a
prelude to analysis. Some tightening of the structure may be possible if it is meant to apply only to a
specific data set. Thus, some or even most possible values of a certain variable may not appear in the
data set. Also, there may be de fuc20 variable conditioning, as when certain information was not
recorded or was lost on certain blocks of data.
A third aspect of logical data structure is Mance. Balance refers quite directly to a specific data
set. The terms balanced or partially balanced are often used in connection with designs involving
blocking and factorial structures, but the definitions given are usually highly specific to narrow situations
or they are somewhat loose. A path to a definition which is both precise and general is to associate
balance with the recognition of groups of symmetries under the permutation of levels of certain
variables. For example, an array of R rows and C columns with n observations in each of the RC cells has
symmetry under the n! permutations of index levels within each cell and also under the R!C!
permutations of cells. If the number of observations nij in row i and column j varies, then there may be
balance only within cells, unless nii depends only on i or on j, in which cases different groups of
symmetries come into play. Another example of balance, although not often referred to as such, is the
familiar multivariate data matrix of n rows and p columns giving the values of p variables on each of n
individuals. The logical structure of such a data matrix is clearly invariant under all n! permutations of
the individuals. It is also invariant under certain permutations of variables, for example all p!
permutations of variables of the same type such as continuous measurements.
Balance makes possible the efficient storage of data as multiway arrays where the labeling of
individual values of variables can be represented very compactly. Balance also introduces simple
structure into mathematical models and related theory, such as when balance introduces simplifying
orthogonalities into the models for analysis of variance. The benefits include simpler theory of
inference, simpler computational algorithms for data analysis, and simpler interpretation of the results
of analysis. The price is mathematical analysis of many special cases, for there are many kinds and
degrees of symmetry possible in a reasonably complex data structure, and detailed development of
techniques to take advantage of the symmetries is correspondingly varied. Finally, it may be noted that
the concept of missing value is closely related to that of balance. From the viewpoint of logical data
structure alone there is no reason to tag any particular absent observation as missing unless it destroys
symmetry. An important area of concern about missing values in practical data analysis centers around
repairing the damage from destroyed symmetry (cf., Section 7).
There is at present no polished typology of logical data structures for multivariate data analysis.
The preceding discussion may raise more questions than it answers, but it will have served its purpose if
it draws attention to the need for a more systematic and extensive foundation at this fundamental level.
Increased cooperation among data analysts and computer scientists in this area would clearly be
beneficial.
2.2. Epistemic structure.
The rational process by which any data set contributes to a particular field of knowledge
depends strongly on preconceptions and understanding associated with that field. Some of this previous
knowledge is quite specific to the field, such as knowledge of how and why an I.Q. test was devised.
Other aspects are general across many fields and so become parts of a general scientific outlook. These
latter aspects are central to data analysis and statistics, especially when they facilitate inference beyond
the immediate facts of a particular data set to more general circumstances.
Three varieties of a priori knowledge will be surveyed briefly. The first of these was introduced
above, namely the knowledge that the value or level of a certain variable contains substantive
information within some recognized field of inquiry, as opposed to being simply an index variable. The
second variety concerns a distinction between free variation and fixed variation. The third refers to
symmetry conditions on a priori knowledge, and leads directly into probabilistic concepts. Each of these
varieties represents a type of knowledge which comes with a data set, from its scientific context, and
which is not empirical in the sense that the information in the data itself does not reinforce or refute the
preconceptions.
As argued in Section 2.1, any variable is capable of substantive interpretation, although certain
variables, such as names, are usually accepted mainly as indexing variables. The possible dual
interpretation corresponds to the mathematical duality between a function as a meaningful entity in
itself and the set of values of a function which formally resemble index values. The role of duality in the
mathematics of multivariate analysis is stressed in.
Some variables are regarded as free in the sense of reflecting natural variation or experimental
variation in response to controlled conditions. Other variables are regarded as fixed, sometimes because
their values have been deliberately set, as in experimentation, but often because the context of the data
set suggests that the variable should be seen as having a determining or causal role in the phenomenon
under study rather than being a direct measure on the phenomenon. The most common statistical
example appears in multiple regression analysis where there is a free or dependent variable to be
related to a set of fixed or independent variables. A similar distinction appears throughout multivariate
analysis, in analysis of variance, and in contingency table analysis, and the distinction is incorporated in
turn into formal hypotheses or models. It is clear that substantive variables can be free or fixed. Index
variables are most often taken to be fixed, since names are usually assigned, but sometimes even pure
index variables can be response variables, as when a name conveys the winner of a race. Whether or not
a variable is free or fixed cannot be deduced from the logical structure of the data, nor from the
quantitative or qualitative information in the data. The distinction between free and fixed need not be
entirely firm in a given situation, but it is conceptually important, and must be made on genuine a priori
grounds.
Most standard data reduction techniques are anchored in a judgment that certain units or levels
of an indexing variable are to be treated symmetrically. The computation of sample moments or the
display of a sample frequency distribution are prime examples where an implicit a priori judgment has
been made to treat evenly among the sample individuals. When several indexing variables appear in a
hierarchical structure, separate symmetry conditions may be applied at several levels of the hierarchy.
The term exchangeable is used in the theory of personal probability for models which treat
symmetrically all subsets of any given size from a set of units, and the same term can be used
consistently in a broader data analytic context to describe the a priori symmetry judgments discussed
above. An analogous symmetry judgment of stationarity is often imposed on time series data. Again, the
term is usually applied to specific probability models for time series data, but is appropriate at a looser
level of assumption to mean that any time stretch of a given length would be regarded and treated a
priori like any other time stretch of the same length. Similar isotropicity assumptions can be made for
data indexed by position in physical space.
The symmetries which define exchangeability, stationarity, and isotropicity should be viewed as
idealized properties of prior knowledge associated with the idealized logical structure of a given data
set. A very basic type of knowledge is the uncertain knowledge of answers to factual questions, this
being the type of knowledge which the theory of probability aspires to deal with. Accordingly it is to be
expected that the probability models which accompany the analysis of any data set will share the
symmetry judgments deemed appropriate to the data set. The restrictions thus imposed on probability
models do not entirely determine the specific mathematical forms of the models, but they powerfully
restrict these forms, often to mixtures of random samples from hypothetical populations. The specific
forms are hypotheses (cf., Section 2.3) which, unlike the symmetry judgments themselves, can and
should be tested out on the data. For example, if multiple measurements are made on a sample of 100
human subjects, exchangeability together with the theory of probability says that the 100 subjects can
be viewed as a random sample from some population, whence hypotheses about the form of that
population become conceptually meaningful as questions about a hypothetical probability mechanism
which could have generated the data. The original symmetry assessment is completely a priori,
however, since any multivariate sample distribution is a conceivable random sample from some
distribution. It may be possible to refute the hypothesis that sample individuals were presented in a
random order, but not the judgment that, as an unordered set, any subset of n individuals was a priori
equivalent to any other subset of n individuals.
2.3. Hypothesis structure.
After logical and epistemic structures are in place, specific mathematical hypotheses are often
introduced to motivate and guide analyses.
Many mathematical models describe approximate deterministic relations. Examples include the
linear models which often accompany discussions of the analysis of variance, and the quadratic relations
appearing in models for factor analysis where both factor scores and factor loadings must be estimated.
Heuristic procedures for fitting such structural models find effective use in exploratory data analysis. The
traditional models of mathematical statistics are probability models. Typically, repeated observations on
a vector of variables are regarded as drawn from a multivariate distribution. The distributions generally
have smoothness properties, and may depend smoothly on hypothesized parameters and on the values
of observables as well. Probability models can give greater precision and descriptiveness to structural
models, for example, by fitting an error distribution which provides a formal tool for quantitatively
assessing deviations from a simple deterministic model. Together with probability models comes the
availability of formal tools of statistical inference for testing fit, estimating parameters, and making
uncertain predictions. The tension between data analysis without and with probability is explored in
Section 3. In Section 4, the discussion turns to an increasingly used general class of models based on
exponential families of distributions. A broad review of mathematical models which have found
substantial use in multivariate data analysis would be a worthwhile but very lengthy task, and is beyond
the scope of this overview.
3. How EXPLORATORY?
Data analysis proceeds in an exploratory mode or a supportive mode. In the former, the data
analyst attempts to pry into the essence of a data set by examination from many angles, using graphs,
charts, tables, scatterplots, etc. His tools may be categories of more or less well-tried and well-
researched summary statistics, such as moments, quantiles, correlations, etc. Or he may use heuristic
algorithms to fit rough models, as in clustering and scaling techniques. In the supportive mode, formal
tools of inference are used to assess the plausibility and precision of formal hypotheses.
Mathematical statisticians have long concentrated their efforts mainly on the supportive side. In
reaction to this one-sidedness, and also to the controversial character of ‘the concepts of inference,
some data analysts have claimed that a careful exploratory analysis can turn up everything of interest in
a data set, rendering supportive techniques unnecessary. In general, however, there are benefits to be
drawn from regarding the two modes as complementary and mutually reinforcing.
On the one hand, a too quick adoption of the supportive mode may lock the data analyst into a
spectrum of hypotheses which would be instantly ruled out by casual inspection of a few plots of
marginal distributions or plots of empirical relationships based on marginal summary statistics. On the
other hand, it is very easy to discern apparently interesting empirical relationships, especially when
many variables are sifted in search of relationships. Supportive techniques which can discount at least
some of the biases introduced by data snooping would appear to be minimal supportive adjuncts to
most exploratory techniques.
A more ambitious defender of supportive techniques could argue that the fitting of precise
mathematical models to data, with its clear logical separation of sample and population concepts, is the
key feature raising statistics to the level of a science, somewhat analogous to physics, in which exact and
sophisticated mathematics is a central part. This is not to deprecate the more intuitive and empirical
exploratory side, but rather to suggest that the two modes operating in resonance, as do experimental
and theoretical physics, create a living science of considerable range and power. Data exploration
suggests hypotheses. Formalized hypotheses in turn suggest under mathematical analysis new
quantities to compute which may be illuminating in the same way that more naive data exploration can
be illuminating. For example, the recent developments in likelihood methods for fitting log linear models
to contingency table data provide natural ways to associate specific interaction terms with individual
combinations of factors (cf., [18] and references cited therein). Having estimates of such interaction
terms, it becomes natural to plot these estimates against the analogs of main effect estimates, with the
aim of discerning special structure or relationships which might be unsuspected a priori and which
would not be visible in a large multiway array of raw data. The mutual relations among probabilistic
fitting procedures and data exploration procedures defines a very large and promising field for research
in multivariate data analysis.
To a non-statistician it may appear that the distinctions drawn here between exploratory and
supportive methods are exaggerated In terms of the three aspects of data structure sketched in Sections
2.1, 2.2, and 2.3, both modes assume a common understanding of logical and epistemic structures, and
they differ mainly in their approach to hypothesis structure. But even the simplest and most descriptive
exploratory analysis is guided by hunches which are naturally conceived as imprecisely formulated
hypotheses. The difference lies mainly in the degree of mathematical precision associated with
hypotheses. Nevertheless, the difference of degree eventually becomes a difference of kind, as access to
powerful mathematical and inferential tools, if used with intelligence and sensitivity becomes a highly
significant factor in statistical methodology.
The explorative side of multivariate data analysis has given rise to classes of techniques which
delve into data structure at the basic level of attempting to discern logical structure not directly
apparent in the data. The oldest such class is factor analysis, directed at finding important variables
which are linear combinations of observed variables, where the criterion of choice generally depends on
a principal component analysis; for example, if a group of variables is highly intercorrelated, then a
linear combination can be found which explains much of the common variation underlying the
correlation. Such combinations are candidates for factors. Forty years of cumulative development have
made factor analysis a vast topic to review. A start may be made from [6,21].
Computer technology has been an obvious catalyst in the development of various close relatives
of factor analysis. In the latter, one draws on correlation coefficients which are inverse measures of
distance between pairs of variables. In cluster analysis [19, 20, 22, 25, 37, 431, distances between pairs
of individuals are used to group individuals into like categories, or more generally to produce
hierarchical tree structures, with individuals at the tips of the smallest branches and with close
individuals growing out of relatively far out common branches. Multidimensional scaling [5,29-3 1,35,
361 also relies on distances, often directly elicited by asking subjects to assess relative distances among
pairs of entities. Again the objective is to place these entities, which may be attributes (e.g., colors) or
individuals (e.g., nations), in spaces of a reasonably small number of dimensions.
In the present state of the art, it is difficult to evaluate procedures of clustering, scaling and
factoring, except by reference to specific applications where the results can be judged meaningful
relative to outside criteria, such as recognizing that a clustering technique groups obviously similar
individuals. Supportive techniques are badly needed, and these in turn require formulation of
acceptable probabilistic models and reliable inference procedures for these models. As argued in
Section 5.1, it is a difficult task to match highly multivariate data sets to models with the right degree of
complexity so that both adequate fit and informative inferences can be secured.
5. PROBLEMS OF INTERPRETATION
5.1. Many parameters.
Many tried and tested techniques of multivariate data analysis were invented at a time when 10
was a typical number of variables in an ambitious data collection program. Now, with automation and
expanded support for scientific investigations, data sets having 100 or even 1000 variables are not
uncommon. Multivariate data analysts need therefore to cultivate increasingly the habit of asking
whether their data will bear the weight of their methods. The question reduces to asking whether fitted
parameters are meaningful or, conversely, whether the numerical processes which produce them are
not empty exercises. Sometimes evidence can be adduced after the fact by recognizing, for example,
substantive meaning in clusters, factors or scales, or by successfully using a fitted model for prediction.
Still, a crucial problem for theoretical statistics is to assess the evidence internally during the course of
data analysis, and to alter that course where necessary so that the outputs of analysis have high signal-
to-noise ratio.
Certain monotonicity relations are nearly self-evident. The potential yield of reproducible model
structure from a given data set depends directly on the sharpness of differences and strength of
relations in the phenomenon underlying the data. Also, the more are the variables acquired, the more is
the danger that interesting differences will be undetectable. And the larger the samples of units
observed, the better one is able to detect structure. For example, it is easier to detect one salient effect
if 10 standard solutions are assayed once than if 100 standard solutions are assayed once; but in either
case the task becomes easier if the assays are repeated.
In terms of statistical inference, the problem with many parameters is that only a relatively small
subset of the parameters, or a small set of functions of the parameters, carry important messages of
substantive interest, while the remaining parameters become nuisance parameters which obscure the
desired information through the unknownness of their values. The problem is especially severe when
the parameters of interest are not known in advance but must be identified by scanning the data.
The conceptual tools for approaching many parameters are not yet well formed, but two basic
themes can be identified. The first of these is the need for measures of the confusion affecting questions
of interest which is due to the uncertainty about many parameters. The decision-theoretic approach to
inference evaluates procedures in terms of operating characteristics, and there is confusion about these
operating characteristics deriving from their dependence on unknown parameters. Admissibility theory
generally restricts the class to Bayes rules. Openly Bayesian approaches can go further by assessing
posterior probabilities for statements about parameters, but still there is confusion due to the
vagueness of initial probability assessments. Approaches to inference which provide upper and lower
bounds on posterior probabilities have built-in measures of confusion in the differences between the
bounds. Further and more direct attempts should be made to define indices of confusion for data drawn
from standard models, even if the indices are rather crude and tenuous in their relation to ideal theories
of statistical inference.
The second theme is the need for guidelines in the use of confusion indices when they are
available. If confusion is large, then one should consider simplifying a model, but this can be done only
at the cost of making the model less realistic. At present, the tradeoff between realism and clarity of
message can only be made intuitively. The author has discussed elsewhere [9] the possibility of making
the tradeoff more formal.
In the case of multivariate techniques, the decision to fit large numbers of parameters is often
made casually, not because adequate fit requires them, but because they arise naturally in the
mathematical formulation of the model and can be fitted with relative ease. Parameter reduction
techniques such as those sketched in Section 4 would then appear to offer clear prospects for more
illuminating data analyses, especially when a decrease in confusion can be bought at little apparent cost
in realism.
5.2. Causality.
A major objective behind much multivariate data collection is to provide support for hypotheses
of cause and to measure the strength of causal relations. In consequence, it is inadequate to avoid the
persistent question: Can statistical data ever be used to demonstrate cause ? The answer would appear
to be yes, but with careful qualifications.
One objection to statistical data can be quickly dismissed. This is the argument which rules out
evidence based on statistical data because numbers cannot specify any mechanism to explain the cause
and effect relation and, in particular, can specify no deterministic mechanism. A carefully controlled and
randomized experiment in which, say, 20 out of 20 drug treated animals survived, while only 2 out of 20
placebo treated animals did so, would generally carry with it a strong implication of causal connection
between treatment and survival, even assuming that the biochemical action is a mystery. Further
detailed knowledge of the mechanism provides a greater degree of understanding about the
circumstances under which causation is operating, but there is no discrete jump to a different and more
satisfying concept of cause. Statistical data and statistical evidence do not differ in kind from empirical
data and evidence generally. As argued below, any empirical data can have only an indirect bearing on
the establishment of a hypothesized causal mechanism. Nor should one require causal explanation to be
completely deterministic. Whether or not all scientific explanation can ultimately be reduced to
deterministic models may be an open question for philosophical disputation. But in the real world,
residual uncertainty can never be eliminated entirely and often must be faced probabilistically,
especially with biological and social phenomena.
Necessary to any assignment of cause is an a priori judgment that an explanatory (perhaps
probabilistic) mechanism could plausibly exist. Otherwise, evidence of association in statistical data has
no direct bearing on the presence or absence of causal relations. The train of logic is that a causal
hypothesis is recognized to imply certain hypotheses of association or increased variability, or other
observable manifestation. Then the latter type of hypotheses can be tested on data, and the failure of
such tests to reject the data patterns implicit in the causal hypothesis provides negative support for the
hypothesis. Negative support is unfortunately the best available, but accumulation of such support from
many data sets eventually creates a sense of confidence akin to positive. support. This is surely as close
to proof as is logically possible with empirical phenomena.
So much for philosophy. The practical difficulties in establishing cause are even thornier than the
philosophical difficulties. One must sort out competing causal hypotheses and find paths through
complex patterns of multiple causes, and through hierarchical systems under which, for example, factor
1 may influence factor 2 and both may subsequently influence factor 3. The principles of experimental
design can of course help with the problem of multiple causes, making some implausible by
randomization and disentangling others by producing orthogonal design vectors. See [7] f or some
difficult aspects of design in relation to causal analysis.
The method of path analysis, due originally to Sewall Wright, has been much used in recent
years by social scientists to help sort out complex patterns. The underlying idea is to create restricted
linear models in which terms are inserted only where plausible causal hypotheses exist. The pattern of
observed correlations is then tested, mainly by eye, for conformity with the pattern implied by the
restricted linear model. Some basic references are [l 1, 38, 47, 48, 491, some recent references are [3,
10, 26, 401, and a bibliography [23] is available. The simultaneous linear equation methods of
economics [16] are close relatives of path analysis. With a few exceptions [38,44-46] statisticians have
not entered the thickets of causal linear models. The need for supportive techniques is clear, however,
and opportunities to sharpen a promising tool should not be neglected.