Simulation Basics PDF
Simulation Basics PDF
State-of-the-Art Review
A User’s Guide to the Brave New World of
Designing Simulation Experiments
Jack P. C. Kleijnen
Department of Information Systems and Management/Center for Economic Research (CentER),
Tilburg University (UvT), Postbox 90153, 5000 LE Tilburg, The Netherlands, kleijnen@[Link]
Susan M. Sanchez
Operations Research Department and the Graduate School of Business and Public Policy,
Naval Postgraduate School, Monterey, California 93943-5219, USA, ssanchez@[Link]
Thomas W. Lucas
Operations Research Department, Naval Postgraduate School, Monterey, California 93943-5219, USA,
twlucas@[Link]
Thomas M. Cioppa
U.S. Army Training and Doctrine Command Analysis Center, Naval Postgraduate School,
PO Box 8692, Monterey, California 93943-0692, USA, [Link]@[Link]
M any simulation practitioners can get more from their analyses by using the statistical theory on design
of experiments (DOE) developed specifically for exploring computer models. We discuss a toolkit of
designs for simulators with limited DOE expertise who want to select a design and an appropriate analysis
for their experiments. Furthermore, we provide a research agenda listing problems in the design of simulation
experiments—as opposed to real-world experiments—that require more investigation. We consider three types of
practical problems: (1) developing a basic understanding of a particular simulation model or system, (2) finding
robust decisions or policies as opposed to so-called optimal solutions, and (3) comparing the merits of various
decisions or policies. Our discussion emphasizes aspects that are typical for simulation, such as having many
more factors than in real-world experiments, and the sequential nature of the data collection. Because the same
problem type may be addressed through different design types, we discuss quality attributes of designs, such as
the ease of design construction, the flexibility for analysis, and efficiency considerations. Moreover, the selection
of the design type depends on the metamodel (response surface) that the analysts tentatively assume; for
example, complicated metamodels require more simulation runs. We present several procedures to validate the
metamodel estimated from a specific design, and we summarize a case study illustrating several of our major
themes. We conclude with a discussion of areas that merit more work to achieve the potential benefits—either
via new research or incorporation into standard simulation or statistical packages.
Key words: simulation; design of experiments; metamodels; Latin hypercube; sequential bifurcation; robust
design
History: Accepted by W. David Kelton, Editor-in-Chief, acting as Area Editor; received January 2003; revised
March 2004, August 2004, January 2005; accepted January 2005.
setting may lead to disastrous results when imple- that simulation analysts might face, the benefits and
mented). Unfortunately, few simulation practitioners drawbacks of various designs in these contexts, and
seem to be aware of the additional insights that can be references. We want to change the mindset of simula-
gleaned by effective use of designs. tion analysts and researchers to consider DOE as an
A second reason may be that DOE research is integral part of any simulation project.
often found in specialty journals seldom read by This overview is based on our experience through
simulation analysts. Many results improve efficiency contacts with many simulation users and researchers
or guard against bias, whereas the bigger picture— over the last few decades. Where we disagree with
namely, the setting for which this class of designs is current practice or theory, we present both sides.
most appropriate—may not be clear to an audience Despite the wide variety of designs that are available
more familiar with simulation modeling. in the literature and, in some cases, statistical or sim-
The primary reason, in our opinion, is that most ulation packages, we identify some situations where
designs were originally developed for real-world needs are still unmet. Our goal is to motivate more
experimentation and have been subsequently adapted research to address these deficiencies.
for use in simulation studies, rather than developed In this paper we use concepts and terminology
specifically for simulation settings. Classic DOE text- from simulation and statistics. Many readers may be
books (e.g., Box et al. 1978, Box and Draper 1987, familiar with simulation—and its DOE aspects—at
Montgomery 2000, or Myers and Montgomery 2002) the level of a textbook such as Law and Kelton (2000,
focus not on the needs of simulation analysts, but on p. xix), who state that a “second course in simula-
the practical constraints and implementation issues tion for graduate students” should cover their Chap-
when conducting real-world experiments. Compre- ter 12 on “experimental design, sensitivity analysis,
hensive simulation textbooks (Law and Kelton 2000, and optimization.” For the reader familiar with only
Banks et al. 2005) cover a broad range of topics and some of these ideas, we introduce brief definitions
provide detailed lists of references, but they demon- and explanations. For a refresher or an overview of
strate DOE by using it on a few simple test problems simulation experiments, see Nakayama (2003) or Kel-
that do not stretch the reader’s mental framework as ton and Barton (2003).
to the depth and breadth of insights possible. Since In §2 we describe how designing simulation exper-
DOE and general simulation books familiarize ana- iments differs from designing experiments on real-
lysts with only a small subset of potential designs and world systems. Specifically, we address questions that
applications, analysts are likely to force their prob- simulation analysts or clients should ask. We also
lems to fit a particular design instead of identifying describe a number of other characteristics of simula-
the design that best meets their needs. tion settings that cannot easily be handled through
Our goal is to bring together (i) a discussion of the more traditional methods, and we provide exam-
issues that analysts should be aware of as they prepare ples to motivate the need for designs that cover a
to code, collect, and analyze output from a simula- wide range of simulation settings. In §3 we discuss
tion model, and (ii) a guide for selecting appropriate some characteristics of designs, including effective-
designs. In particular, we contend that analysts must ness. In §4 we describe several classes of designs, and
consider: assess their strengths and weaknesses for various sim-
• the types of questions; ulation settings and their design characteristics. In §5
we describe ways to check the design’s assumptions.
• characteristics of their simulation setting;
In §6 we present a small case study, and in §7 we con-
• characteristics of, and constraints imposed on,
clude with areas that merit additional work to achieve
the simulation data collection and analysis; and
the potential benefits—either via new research or via
• the need to convey the results effectively.
incorporation into simulation or statistical software
These issues seem straightforward, but fundamen-
packages. An Online Supplement to this paper on the
tal problems related to designing simulation experi-
journal’s website provides more details.
ments are all-too-often overlooked. We discuss these
more fully later, focusing on the practical benefits of
DOE. A design suited to a particular application is 2. Why Is DOE for Simulation so
much better than trial and error or a simple, small Different?
design. Consequently, practitioners should be open to We begin with some terminology. An input or a param-
the notion that DOE is a useful and necessary part of eter in simulation is referred to as a factor in DOE.
analysis of complex simulation. A factor can be either qualitative or quantitative. For
This is not a tutorial on the details for implement- example, in a queueing simulation, queue discipline
ing specific designs, nor do we present a historical can be either LIFO (last in, first out) or FIFO (first
development of DOE and simulation. Instead, we pro- in, first out), so this is a qualitative factor. The num-
vide an overview of the wide variety of situations ber of servers is a discrete quantitative factor, while
Kleijnen et al.: A User’s Guide to the Brave New World of Designing Simulation Experiments
INFORMS Journal on Computing 17(3), pp. 263–289, © 2005 INFORMS 265
the rate for an exponential distribution used to model more insight into how systems behave, and so pro-
customer inter-arrival times is a continuous quanti- vide assistance and information to decision makers
tative factor. Each factor can be set to two or more that might differ dramatically (in terms of its quan-
values, called factor levels, typically coded numeri- tity and nature) from that obtainable using more tra-
cally for analysis purposes. A scenario or design point ditional methods. It is a challenge because it may
is a combination of levels for all factors. We con- require a new mindset. Indeed, we argue that the way
sider stochastic simulations, so replicates mean that simulation experiments should be approached is now
different pseudo-random numbers (PRNs) are used to fundamentally different from the way that real-world
simulate the same scenario. Unless otherwise speci- experiments involving, say, human subjects, should
fied, we assume that replicates use nonoverlapping be approached.
PRN streams, so outputs across replicates are inde- To illustrate the difference between classic DOE and
pendently identically distributed (IID)—as most sta- simulation DOE, consider the classic bias-minimizing
tistical methods assume. The output stream from a designs. For example, Donohue et al. (1993), assum-
single replicate is a time series, which generally has ing a first-order metamodel but allowing for possible
bias caused by second-order effects, derive designs
auto-correlated observations; for example, the basic
that minimize that bias. We argue that such designs
single-server model gives auto-correlated individual
are relevant in real-world experiments but not in sim-
waiting times (if one customer must wait a long time,
ulation. In the former cases, analysts must often select
then the next customer is also apt to wait a long time).
a design that is executed in “one shot” (say, one
Of course, the simulation is itself a model of some growing season in agriculture). In contrast, the data
real-world (or prospective) system, process, or entity. are collected sequentially in most simulation experi-
We can view the simulation code as a black box that ments, so analysts may start with a design for a first-
implicitly transforms inputs (such as factor-level set- order metamodel, then test (validate) the adequacy
tings and PRNs) into outputs. A metamodel (or response of that model, then augment their design to one that
surface, auxiliary model, emulator, etc.) is a model or allows the estimation of second-order effects only if
approximation of this implicit Input/Output (I/O) necessary (see, e.g., Sanchez et al. 1998, Kleijnen and
function that characterizes the relationship between Sargent 2000).
inputs and outputs in much simpler terms than the
full simulation. When a simulation experiment is con- 2.1. Asking Appropriate Questions
ducted and most or all of the factors are quanti- The importance of identifying the “right” problem
tative, a common metamodeling technique is that before constructing a simulation and conducting the
of polynomial regression. We assume that the I/O analysis is well known. For example, Law and Kelton
relationship is composed of a deterministic (predict- (2000) state that the first step in a simulation study
able) component, which is a polynomial function of is to formulate the problem and plan the study. This
the input factors, and a stochastic component that step includes the project manager stating the prob-
captures the (typically additive) error or random- lem of interest, and the analysts specifying the overall
ness in the response. In a first-order or main-effects study objectives, specific questions to be answered,
model, the deterministic component takes the form performance measures that will be used to evalu-
ate the efficacy of different system configurations,
f X1 X2 Xk = ki=1 i Xi , where the Xi are the
system configurations to be modeled, and the time
factor-level settings and the i are estimated from
and resources required for the study. They go on
the data. A second-order model could also include the
to say that experimental design, sensitivity analy-
quadratic effects ki=1 i i Xi2 and the two-way interac-
k sis, and optimization deal with situations in which
tions k−1i=1 j=i+1 i j Xi Xj . Higher-order models might there is “ less structure in the goal of the simula-
also be defined, but are harder to interpret. tion study: we may want to find out which of pos-
We emphasize the following chicken-and-egg prob- sibly many parameters and structural assumptions
lem. Once a design is specified and simulated, meta- have the greatest effect on a performance measure, or
model parameters can be estimated. On the other which set of model specifications appear to lead to
hand, the types of metamodels that the analyst desires optimal performance.” (Law and Kelton 2000, p. 622).
to investigate should guide the selection of an appro- We recommend an even broader view since we find
priate design. that the most common type of question concerns an a
DOE developed to generate and analyze data effi- priori single specific performance measure (typically
ciently from real-world experimentation. In simula- a mean) that analysts then try to estimate or optimize.
tion, with its advances in computing power, we are Instead, our starting point is a set of three basic goals
not bound by some of the constraints that characterize that simulation analysts and their clients may have:
real-world experiments. This is both an opportunity • developing a basic understanding of a particular
and a challenge. It is an opportunity to gain much simulation model or system,
Kleijnen et al.: A User’s Guide to the Brave New World of Designing Simulation Experiments
266 INFORMS Journal on Computing 17(3), pp. 263–289, © 2005 INFORMS
• finding robust decisions or policies, and corridor and then exchanges fire with agents behind
• comparing the merits of various decisions or the front lines. Discussion with the software devel-
policies. oper confirms that these barriers prohibit movement
2.1.1. Developing a Basic Understanding. The but not fire, behaving as ditches rather than walls.
first goal covers a wide range of questions. We use this This is a low-probability event in Wan’s eventual sce-
phrase rather than “testing hypotheses about factor nario and illustrates how DOE can uncover details
effects” for the following reason. At one extreme, we about model behavior that might not be revealed
may develop a simulation to gain insight into situa- without a broad-based investigation of factor effects.
tions where the underlying mechanisms are not well Gill and Grieger (2003) run several experiments to
understood, and where real-world data are limited or examine movement rules in time-step, agent-based
even nonexistent. At the other extreme, we may per- modeling platforms. Their results led to a discussion
form a detailed analysis of a verified and validated of the implications of how various platforms imple-
simulation model. ment an agent’s movement when, e.g., the agent’s
As an example of the first situation, Dr. Alfred propensity for moving toward a goal is “twice as
Brandstein posed the question “When and how high” as its propensity for avoiding enemy contact.
should command and control be centralized or decen- This can affect the development of new scenarios by
tralized?” when he was Chief Scientist of the Marine directing modeling efforts toward choosing appropri-
Corps (Brandstein 1999). We do not know enough ate weights for these propensities in different con-
about the human mind to program a model for how texts. Without this investigation, scenario developers
decisions are really made by an individual—let alone and analysts might all use the same phrase to describe
a group of people! Yet, ignoring these types of ques- movement behavior, but have different internal views
tions because they are “too hard” or “inappropriate” of its meaning.
for operations research is unacceptable. Our profes- When analysts may not know ahead of time what
sion’s roots are in finding ways to address difficult, questions to ask, DOE can help. For example, anal-
interdisciplinary problems. ysis of a model of a skirmish involving guerrilla
Addressing new problems often requires new sim- forces attacking a conventional force reveals that the
ulation models. We find that making DOE an inte- most important determinants of losses on both sides
gral part of the model development process is useful are factors associated with the guerrillas’ stealth and
in several ways. DOE can uncover detailed insight mobility (Lucas et al. 2003). The scenario was initially
into the model’s behavior, cause the modeling team set up to explore the merits of conventional force tac-
to discuss in detail the implications of various model tics, movement, and squad strength, but the findings
assumptions, help frame questions when the analysts suggest that the defenders may gain more in terms of
may not know ahead of time what questions should survivability and lethality by improving their ability
be asked, challenge or confirm expectations about the to detect terrorists than by increasing their firepower.
direction and relative importance of factor effects, and Confirming prior expectations can be an important
even uncover problems in the program logic. These step in establishing face validity for simulation mod-
situations are seldom described in the literature, par- els, but it is also informative when the simulation pro-
ticularly as they relate to problems in programming vides insights that do not match expectations. In early
logic or modeling assumptions. To illustrate these investigations of a model of maneuver through an
benefits, we provide some anecdotal evidence from urban environment where experimentation was lim-
recent workshops and related research on the use of ited to investigations of five factors at a time (Lucas
agent-based models for military decision making (also et al. 2002), the factors believed by a small group
see Horne and Leonardi 2001; Sanchez and Lucas of subject-matter experts to be the most important
2002; Horne and Johnson 2002, 2003; Cioppa et al. turn out to have no statistically significant impact.
2004). Clearly, the benefits of incorporating DOE into Subsequent experiments provide new insights, indi-
the model-development process also apply to other cating that the commanders’ propensities to maneu-
types of simulation models. ver toward friendly agents and away from enemy
Wan (2002) uncovers details about how a model- agents are critical and that losses are reduced when
ing platform behaves when simple terrain features are the commander balances his drive toward the goal
added. While familiarizing himself with the modeling with the avoidance of enemy contact. An interaction
platform, he sets up a skirmish within a corridor and term reveals that a strong bond between the com-
uses simple experimental designs to generate data. mander and subordinates can mitigate the negative
Wan initially expects that “barriers” prohibit move- impacts of so-called friction on the battlefield: even
ment and provide protection from fire, as in earlier if the subordinate agents cannot hear, comprehend,
agent-based combat simulations. Instead, he uncovers or otherwise act on their commander’s orders, their
instances where an enemy agent circles around the losses are reduced if they stay with him.
Kleijnen et al.: A User’s Guide to the Brave New World of Designing Simulation Experiments
INFORMS Journal on Computing 17(3), pp. 263–289, © 2005 INFORMS 267
Another major benefit of integrating DOE into the thorough investigation of this short list via subse-
model-development process is the ability to uncover quent simulation experiments, a decision to forego
problems in the program logic. For example, Wolf adding enhancements or greater detail to aspects of
(2002, p. 37) mentions anomalous results that, after the model that are found to be unimportant (at least
investigation, are attributed to a modeling artifact. A over predetermined factor-level ranges), or collection
simple experiment reveals that movement speeds do of (additional) real-world data to home in on appro-
not behave as the analysts expect since increasing the priate values of (say) influential input distributions.
factor setting from 0 to 1000 (corresponding to speeds Alternatively, simply identifying the most influential
of 0 to 10 squares per time step) does not monoton- factors (and their directional effects on performance)
ically increase speeds. For example, if the movement may suffice. Of course, factor importance depends
setting is 110, then the agent moves two steps 10% on the experimental domain (or experimental frame, as
of the time but remains in place 90% of the time, for Zeigler et al. 2000 call it). For example, oxygen supply
an average speed of only 0.1 squares per time step. is important for missions high in the sky and deep
Identification of this problem led to its modification in under water, but not on land at sea level. So the clients
subsequent versions of the software, yet (Wolf 2002) must supply information on the intended use of the
it makes one wonder how often similar problems go simulation, including realistic ranges of the individ-
undetected in models of complex scenarios. ual factors and limits on the admissible scenarios. This
The benefits of using experimental design during includes realistic combinations of factor values; for
model development are likely to become even more example, some factor values must sum to 100%.
substantial. If the ultimate decision maker needs rapid
turnaround on the model development and analysis, 2.1.2. Finding Robust Decisions or Policies. We
this mandates using a modeling platform or reusing discuss robust policies, rather than optimal policies,
previously developed code to put together a model for a reason. It is certainly true that finding the opti-
quickly. When code or modeling platforms are used mal policy for a simulated system is a hot topic,
or combined in new ways, some details of the model- and many methods have been proposed. These meth-
ing logic may be either poorly documented, or poorly ods include heuristic search techniques that treat the
understood by the user. simulation model as a black box—such as genetic
In exploratory environments like those above, it algorithms, response surface methodology (RSM),
does not make sense to use the models to estimate simulated annealing, and tabu search—and meth-
factor effects numerically—we seek tendencies rather ods that analyze the simulation model to estimate
than values. At the other extreme, suppose we have gradients—such as perturbation analysis and score
a model that we are comfortable using for prediction. functions. The latter two techniques can be used in
Then “understanding the system” may result from an optimization algorithm such as stochastic approxi-
performing a detailed sensitivity analysis of a partic- mation. Fu (2002) and Spall (2003) discuss the current
ular system configuration (i.e., examining the impact research and practice of optimization for simulation.
of small departures from this configuration). How Unfortunately, all these methods implicitly condition
should we proceed? Searching for effects by varying on a large number of events or environmental factors.
factors one at a time is an ineffective means of gaining In practice, the future environment is uncertain, so
understanding for all but the simplest systems. First, this so-called optimal policy cannot be achieved and
when using this approach it is impossible to identify may break down completely. Therefore, we wish to
any interaction effects between two or more factors, find a robust policy, that is, one that works well across
where positive interactions imply that factors com- a broad range of scenarios. Such policies have also
plement each other, and negative interactions imply been called “satisficing” (Simon 1981).
that factors are partial substitutes for each other. Sec- To illustrate this problem with classic optimiza-
ond, even when interaction effects are negligible so tion, consider using simulation to explore different
one-factor-at-a-time sampling provides valid insights layouts for a small factory. The project manager’s
into I/O relationships, this can be proven to be an decision factors relate to the type, number, position,
inefficient way to estimate the factor effects. From the and buffers associated with machines on the factory
outset, the analysts must explore factor effects con- floor, as well as schemes for prioritizing or expe-
currently to understand how their simulation model diting orders. This is a prototypical problem often
behaves when its factors are changed. analyzed using a simulation optimization method,
Between these two extremes of exploratory inves- but the result of the “optimization” is conditioned
tigations and prediction generation are situations on assumptions of specific (typically assumed inde-
where we wish to identify a short list of important pendent) distributions for order arrivals, order sizes,
factors from the long initial list of potential factors. machine uptimes, downtimes, and service times, and
Depending on context, this might lead to a more many more input variables. We argue that using the
Kleijnen et al.: A User’s Guide to the Brave New World of Designing Simulation Experiments
268 INFORMS Journal on Computing 17(3), pp. 263–289, © 2005 INFORMS
term “optimum” is problematic when the probabil- lead to big changes in the output). If sensitivity anal-
ity of all these assumptions holding in practice—even ysis around the so-called optimal solution indicates
for a limited time—is effectively zero. Suggesting dif- that it still performs well (in an absolute sense) when
ferent possible “optimal” layouts for several potential realistic departures from these assumptions occur,
customer order patterns may be singularly unhelpful then the optimization algorithm has identified a solu-
since the decision maker cannot control future orders tion that is likely to perform well in practice. If
and may not even have good forecasts. changes in the environment (e.g., new patterns of cus-
In contrast, a robust design approach treats all these tomer orders) affect all potential solutions similarly,
assumptions as additional factors when running the then the relative merit of particular policies does not
experiment. These are considered noise factors (rather change. If factor settings associated with good mean
than decision factors) because they are unknown or responses are also associated with low response vari-
uncontrollable in the real-world environment. A ro- ances, then the optimal solution in terms of mean per-
bust system or policy works well across a range of formance will also be robust. In a recent case study,
noise conditions that might be experienced, so imple- Kleijnen et al. (2003) derive a solution that minimizes
menting a robust solution is much less likely to result both the expected value and the variance of the out-
in unanticipated results. For example, Sanchez et al. put of a supply chain simulation. That solution is con-
(1996) use a multistage sequential approach to eval- trolled by the decision factors; the mean and variance
are computed across several environmental scenarios.
uate factory layouts and order-dispatching rules in
Nonetheless, there are situations where optimizing
a job shop when the mix of demand for different
and then performing sensitivity analysis can lead to
products is unknown. They find that the “best” fac-
fundamentally different answers. For example, a mil-
tory setup uses neither the commonly used economic
itary problem of current interest is finding a good
order quantity nor the just-in-time dispatching rule
strategy for defending a high-value target (court-
for batching orders. Their robust solution yields a 34%
house, church, or monument) against a single terror-
smaller loss (in expected squared deviation from the ist. If the analysts condition on the route the terrorist
target total time in system) than a solution that opti- will take approaching the building, then forces will be
mizes the mean time in the system, and a 10%–22% concentrated along this path. Conversely, if the direc-
improvement over solutions that minimize the vari- tion of approach is unknown, then an entirely differ-
ance of the time in the system; the robust solution ent strategy (dispersing the protective forces) is much
uses no more (and usually fewer) total machines, more effective.
indicating potential savings in capital expenditures.
Another example is Kleijnen and Gaury (2003) who 2.1.3. Comparing Decisions or Policies. We
assume a base production planning scenario and com- avoid the phrase “making predictions about the
performance of various decisions or policies.” Com-
pare several solutions to identify which solution is
parisons may need to be made across a number of
least sensitive to changes in the environment.
dimensions. Rather than formal statistical methods
This robust design philosophy is inspired by Taguchi
for testing particular factor effects or estimating a
(1987), who uses simple designs to identify robust
specific performance measure, our goal might be to
product configurations for Toyota. The results im-
provide the decision maker with detailed descrip-
prove quality while lowering the cost of automobiles
tive information. For example, we could present the
and component systems because the chosen product means, variances, percentiles, and any unusual obser-
designs perform well—despite variations in incom- vations (see the box plots in Law and Kelton 2000)
ing raw material properties, the manufacturing pro- for the distribution functions of the estimators of sev-
cess, and customers’ environments. Sanchez (2000) eral performance measures, for each of the systems of
discusses robust design for simulation experiments. interest. These measures can then be reported, along
Metamodels can suggest scenarios (i.e., new combina- with implementation costs and other considerations
tions of factor levels) not yet investigated, although not included in the model.
the analyst should make confirmatory runs before If at least some of the factors are quantitative, and
applying the results. if a performance measure can be clearly stated, then
We do not mean to imply that an optimiza- metamodels can describe the I/O relationships via
tion approach will necessarily yield a bad answer. functions of various factor levels. Here, rather than
An analyst can perform sensitivity analysis on any running an experiment to gain insight into how the
particular solution, either formally (e.g., applying performance is affected by all the factors, we may
DOE techniques or invoking mathematical arguments focus on a few of immediate interest to the decision
on the nature of the response surface) or informally maker.
(e.g., performing some trial-and-error investigations If the analysts wish to compare a fixed small num-
to determine whether small changes in the scenario ber of statistical populations (representing policies or
Kleijnen et al.: A User’s Guide to the Brave New World of Designing Simulation Experiments
INFORMS Journal on Computing 17(3), pp. 263–289, © 2005 INFORMS 269
scenarios), ranking and selection procedures (R&S), extremely complex systems and processes. For exam-
multiple comparison procedures (MCPs), and multi- ple, if the veracity of a given simulation model can-
ple ranking procedures (MRP) can be used. There are not be ascertained, but the simulation is known to
two basic approaches: (i) how to select, with high favor one system over another, this knowledge can
probability, the system, decision, or policy that is—for sometimes be used to make a strong decision. Sup-
practical purposes—the best of the potential choices; pose a warfare simulation is known to be biased in
and (ii) how to screen the potential systems, deci- favor of the red side over the blue side (say, the
sions, or policies to obtain a (random-size) subset of simulation assumes the red force members are all
“good” ones. Many procedures have been developed 10 feet tall), yet the blue side always wins in the sim-
specifically to address some of the characteristics of ulation. Then we can be confident that the blue side
simulation experiments we discuss in §3 (Chick and will win such a conflict in the real world. This type
Inoue 2001, Hsu 1996, Goldsman et al. 2002, Nelson of reasoning is called an a fortiori argument (Hodges
and Goldsman 2001). Some assume that all popula- and Dewar 1992). Note that if the red side wins in
tions are compared with each other, whereas others the simulation, we do not know whether this result
assume comparisons with a standard. occurs because the red side is indeed better or because
the simulation is biased. An unvalidated simulation
2.1.4. Summary. The three types of questions we
model can also be used to generate plausible out-
pose differ from those typically suggested in the
comes if they are consistent with all available infor-
literature. Sacks et al. (1989) classify problems for sim-
mation deemed salient. One can easily construct a
ulation analysts as prediction, calibration, and opti-
case in which an action would be avoided if a sim-
mization. Kleijnen (1998) distinguishes among global
ulation suggests that a potentially catastrophic out-
(not local) sensitivity analysis, optimization, and val-
come is plausible. More typically, high-dimensional
idation of simulation models. (In global sensitivity
explorations of unvalidated models are used to help
analysis the simulation inputs vary over the whole
devise new ideas (i.e., tools for brainstorming) or to
experimental area, rather than infinitesimally.) These
trace the consequences of assumptions over a variety
two classifications are related to the ones we use—
of conditions.
for example, global sensitivity analysis can be used as
The situations above contrast sharply with many
a way of gaining understanding about a problem—
simulation experiments in the literature that often
but there is not a one-to-one mapping. For certain
assume a thoroughly validated and verified simu-
classes of simulations, such as military operations or
lation model exists, and that the decision makers
hazardous waste disposal, data are extremely limited
have very specific questions about, e.g., the impact
or nonexistent, so calibrating, optimizing, predicting,
on a particular performance measure resulting from
and validating may be meaningless goals.
changing a small number of factors to specified (new)
In the best tradition of scientific discovery, simu-
values. The users might hypothesize the nature and
lation experiments can, nonetheless, have a role in
strength of a particular factor effect, and the analysts’
supporting the development of insights (or theories) in
charge is to run the simulation model and collect I/O
these situations. For example, Helton and Marietta
data to test this hypothesis.
(2000) discuss how to assess the performance of a
nuclear waste plant in New Mexico in the next 10,000
2.2. The Simulation Setting
years. Obviously, such a model is hard to validate
We now describe characteristics of simulation settings
due to a dearth of data and changing conditions.
that call for nontraditional designs, drawing on recent
Nevertheless, extensive sensitivity analyses convinced
practical examples from industrial and military appli-
the Environmental Protection Agency of the validity
cations for motivation.
of this model, so it granted a permit to build and
exploit this plant. Dewar et al. (1996) also discuss 2.2.1. Number of Potential Factors. In real-world
how one can credibly use models that cannot be val- experiments, only a small number of factors are
idated in the simulation of future military conflicts— typically varied. Indeed, it is impractical or impossi-
an area of almost unimaginable complexity. Despite ble to attempt to control more than, say, 10 factors;
the tremendous amount of uncertainty about poten- many published experiments deal with fewer than 5.
tial future conflicts, decisions must be made (such as Academic simulations, such as single-server queueing
what equipment to purchase, how to organize units, models, are also severely limited in terms of the num-
and how to use future forces) that will affect large ber of potential input factors. In contrast, a multitude
sums of money and affect many lives. Since simu- of potential factors exists for simulation models used
lations of potential future force-on-force cannot be in practice. For example, the Map Aware Non-uniform
validated (Hodges 1991), these simulations are used Automata (MANA) software platform was developed
to assist decision makers in gaining insights into to facilitate construction of simple agent-based mod-
Kleijnen et al.: A User’s Guide to the Brave New World of Designing Simulation Experiments
270 INFORMS Journal on Computing 17(3), pp. 263–289, © 2005 INFORMS
els (Lauren and Stephen 2002). The agents’ rules reschedule, minimizing the errors in processing cus-
for movement are a function of a “personality” or tomer transactions, and balancing workloads across
propensities to move based on 10 competing goals. In servers. Other examples are the various performance
all, over 20 factors can be modified for each agent for measures in supply-chain management; see Kleijnen
each of 49 personality states, so we are dealing with and Smits (2003). Consequently, it is restrictive to use
thousands of factors. Yet this is considered a “sim- a DOE framework suggesting that the appropriate
ple” modeling platform! Other examples abound. Bet- goal of the study should be examining the expected
tonvil and Kleijnen (1997) describe an ecological case value of a single performance measure.
study involving 281 factors. Cioppa (2002) examines Taguchi’s (1987) robust design approach offers an-
22 factors in an investigation of peace-enforcement other way to proceed in the case of multiple perfor-
operations. Even simple queueing systems can be mance measures. If responses are converted to losses
viewed as having a few dozen factors if the analysts and appropriately scaled, then analysts can construct
consider arrival rates and distributions that change models of overall expected loss. We prefer to construct
over time, service distributions, and correlations aris- separate metamodels for each performance character-
ing when service times decrease or servers are added istic because it makes it easier to identify why certain
as long lines of customers build up. scenarios exhibit more or less desirable performance
Good programming avoids fixing the factors at than others.
specific numerical values within the code; instead, A few researchers use a mathematical-program-
the code reads factor values so the program can be ming framework to analyze multiple simulation out-
run for many combinations of values. Of course, the puts, i.e., one output is minimized whereas the re-
code should check whether these values are admis- maining outputs should satisfy prefixed constraints
sible; that is, do these combinations fall within the (Angün et al. 2002). For example, inventory may be
experimental domain? Such a practice can automatically minimized while the service percentage meets a pre-
provide a list of potential factors. Next, users should specified level.
confirm whether they indeed wish to experiment with 2.2.3. Response-Surface Complexity. Assump-
all these factors or whether they wish to fix some fac- tions about the metamodel’s complexity are generally
tors at nominal (or base) levels a priori. This type of broken down into assumptions regarding its deter-
coding helps unfreeze the mindset of users who would ministic and its stochastic components and often drive
otherwise be inclined to focus on only a few factors. the analysis. The standard assumptions in DOE are
that the deterministic component can be fit by a poly-
2.2.2. Choice of Performance Measures. Con- nomial model of the factor levels (perhaps after suit-
sider both the type and the number of perfor- able transformations of the factors or responses) and
mance measures. Some problems require only relative that the stochastic component can be characterized
answers, i.e., whether one policy is better than as additive white noise. The latter assumption means
another. For example, in a study on the search for sea that the residuals of the metamodel are normally
mines, users wanted to know which sonar tilt angle distributed and IID. In practice, normality may be
of the sonar is better; see Kleijnen (1995). Conversely, explained by the central limit theorem, but the IID
some problems require absolute answers. For exam- assumption is violated when the noise has larger vari-
ple, in the same case study, users wanted to know ances in subspaces of the experimental area. This is
whether the probability of mine detection exceeds a known as variance heterogeneity or heteroscedasticity and
certain threshold before deciding whether to do a is pervasive in simulation. For example, in queueing
mine sweep at all. problems the intrinsic noise increases dramatically as
Most procedures (e.g., R&S, MCPs, MRP, and RSM) the traffic load approaches 100% (Cheng and Kleijnen
involve a single quantitative performance measure; 1999, Kleijnen et al. 2000). Moreover, common ran-
the goal is typically to maximize or minimize the dom numbers (CRNs) are often used for generating
expected value of a single simulation output. How- output from several simulation scenarios since they
ever, in many simulation applications, it is unreal- can sharpen the comparison among systems. Unfor-
istic to assume a single measure characterizes the tunately, CRNs violate the independence assumption.
system performance. For example, textbook examples Good modeling practice means that the analyst
of simple queueing systems often discuss minimizing should strive to find the simplest metamodel that
the average waiting time. In practice, other choices captures the essential characteristics of the system
include minimizing the proportion of customers who (Occam’s razor). Therefore, we need a suite of design
wait more than a specified time, maximizing the tools, some appropriate for simple response surfaces
number served within a particular time, improv- and others for more complex systems. Simpler meta-
ing customer satisfaction by providing information models are often easier to justify when only a small
about projected wait time and allowing customers to number of factors and performance measures are
Kleijnen et al.: A User’s Guide to the Brave New World of Designing Simulation Experiments
INFORMS Journal on Computing 17(3), pp. 263–289, © 2005 INFORMS 271
examined, yet interpreting the results may be prob- all) the time-series output is averaged or aggregated
lematic because the analyst may easily miss important into batches. The choice of the number of batches and
system characteristics. In §4, we describe how some batch sizes is an important topic of research in itself
designs allow assessment of the suitability of the esti- (e.g., Schmeiser 1982, Steiger et al. 2005, Alexopou-
mated metamodel. In principle, we prefer to classify los and Goldsman 2004), and an implicit assump-
factors into four categories: (i) factors thought to be tion in many simulation-analysis techniques is that
very important, (ii) factors that might be important, appropriate batch sizes and warm-up periods are
(iii) factors that are thought to be unimportant but are used. Other simulation-specific factors that can be
sampled anyway, and (iv) factors that we are quite controlled include the use of CRNs to facilitate com-
comfortable in ignoring. Designs that sample differ- parisons across alternatives. For example, all potential
ently across these classifications make intuitive sense. factory layouts can be subjected to the same pattern of
It is increasingly apparent that some systems customer orders. Other variance-reduction techniques
exhibit highly nonlinear behavior. A system’s re- (VRTs), such as control variates and importance sam-
sponse surface may be characterized by localized pling, have been developed for simulation output (see
regions where the response differs sharply from the Law and Kelton 2000). Unfortunately, not all designs
surrounding area (spikes). It may contain thresholds can easily accommodate these VRTs.
(large, smooth contours in the factor space) where
the response is discontinuous, so if the threshold is 2.3. External Concerns and Constraints
crossed the response steps up or down. It may con- We now discuss issues that often play a major role
tain regions over which the response is chaotic, i.e., in the implementation of simulation experiments,
extremely sensitive to tiny changes in the input-factor although they are generally not discussed in the
settings so that the output appears impossible to pre-
literature.
dict. For example, Vinyard and Lucas (2002) make
billions of runs and find that chaotic behavior is ram- 2.3.1. Sequential vs. One-shot Data Collection.
pant across many performance measures in a simple In real-world experiments, the basic mindset is often
deterministic model of combat. Designs that examine that data should be taken simultaneously unless the
only a small number of scenarios are unable to reveal design is specifically identified as a sequential design.
such behavior; instead, the analysts may believe they When samples must be taken sequentially, the exper-
are facing a simulation model with a large stochastic iment is viewed as prone to validity problems. Ana-
component. lysts must therefore randomize the order of sampling
2.2.4. Steady-State vs. Terminating Simulations. to guard against time-related changes in the exper-
Terminating simulations run until a specific event imental environment (such as temperature, humid-
has occurred; for example, we might simulate a sin- ity, consumer confidence, and learning effects) and
gle day’s operation of a retail establishment. Steady- perform appropriate statistical tests to determine
state simulations have no natural termination point, whether the results have been contaminated.
so they can keep generating data for their analysis. Most simulation experiments are implemented se-
The simulation type has implications on the design quentially even if they are not formally analyzed that
and analysis. For terminating simulations, it may be way. If a small number of design points are explored,
necessary to censor results if we are simulating rare this implementation may involve the analysts manu-
events; see Kleijnen et al. (2001). For steady-state sim- ally changing factor levels. An approach less prone
ulations, the initial conditions are often chosen for to data-entry errors involves automatically generat-
convenience rather than relevance, e.g., a simulation ing an input file, or series of input files, once a par-
of a computer network may start with all servers ticular design has been chosen. These files may be
and relay nodes operational and no demands on the executed sequentially (and efficiently) in batch mode.
system. Here, the simulation output of the warm-up Modifying simulations to run in parallel over differ-
period biases the estimated response. The length of ent computers is possible but not typical. For exam-
the warm-up period affects the total time required for ple, parallelization is being used effectively at the
experimentation. supercomputing clusters of the Maui High Perfor-
2.2.5. Inclusion of Simulation-Specific Factors. mance Computing Center ([Link]
Analysts have control over many things during the and the Mitre Corporation in Woodbridge, Virginia
course of a simulation study (in addition to the factor ([Link] In many cases, paralleliza-
levels they manipulate and the performance measures tion results from manually starting different runs
they collect). This control includes the maximum (or sets of runs) on a few computers to cut down
run time for terminating simulations. For steady-state on the overall time to complete the data collection.
simulations this control includes specifying the warm- For example, Vonk Noordegraaf et al. (2003) use
up period and run length(s), as well as how (if at five PCs to finish their 64 scenarios, each scenario
Kleijnen et al.: A User’s Guide to the Brave New World of Designing Simulation Experiments
272 INFORMS Journal on Computing 17(3), pp. 263–289, © 2005 INFORMS
replicated twice, in two weeks. Freely available soft- ily restrictive. Even if a single computer is used, the
ware, such as that used on literally thousands of PCs time per run is seldom fixed. Different analysts might
as part of the search for extraterrestrial intelligence use different run lengths and batch sizes. Run times
([Link] could be used to facilitate might vary across scenarios because some tend to
parallel data collection for simulation experiments, yield fewer events or have different warm-up periods
but this is not yet readily available to simulation ana- in steady-state simulations, or lead to early termina-
lysts in the industrial or academic settings with which tion for non-steady-state simulations. Implementing
we are familiar. a design may be very easy if software is available
to generate coded factor levels, next convert them to
2.3.2. Premature Termination of the Experiment.
original factor levels, and then generate input files so
When a simulation takes a nontrivial amount of
the simulation can be run in batch mode. Conversely,
time to run, analysts may have to terminate the
if the analysts must edit and recompile the code for
experiment prematurely because the computer breaks
each scenario, or make all changes manually through
down, the client gets impatient, preliminary results
a graphical user interface, implementation time can
are needed, etc. This premature termination occurs
surpass the run time.
in many defense simulation projects. It is then bet-
Another way of describing this data-collection
ter that the analyst organize the list of scenarios so
effort is by the time required to estimate the meta-
that the output can provide useful information, even
model parameters to a certain level of precision.
if curtailed. For example, consider a simulation where
Unfortunately, it is difficult to make generic rec-
a single input factor (taking the value 1 or 2) defines
ommendations using this approach since the time
two systems the decision maker wishes to compare,
depends on the underlying (heterogeneous) variabil-
each run takes one day of CPU time, the design
ity. We have recently seen run time vary from less
specifies 30 replications of each system, and a single
than a second to over 100 hours per scenario on a
computer is available (so it will take two months to
single processor.
complete the experiment). If all 30 runs for system 1
A related issue is the trade-off between the num-
are conducted before beginning the runs for system 2,
ber of design points and the number of replicates
it will be impossible to say anything about the rela-
per design point. Suppose the total computer time is
tive merits of the systems until the end of day 31. In
the same for the two options: one with many repli-
contrast, an alternating sequence of runs allows pre-
cates per design point, and another with more design
liminary comparisons as early as the end of day 2, and
points and fewer replicates. The first option enables
half the data on each system is available by the end of
explicit estimation of response variances that can vary
day 30. According to DOE theory the scenarios could
across scenarios. If the primary goal of the study is
be run in any order, but the latter approach is clearly
finding robust systems or policies, then some replica-
preferable if preliminary results might be requested
tion at every design point is essential. If the goal is
or the experiment might be terminated early. This
understanding the system, this may also include under-
idea also applies when runs are conducted on multi-
standing the variance, again mandating replication.
ple machines or there are multiple input factors, each
However, if the goal is that of understanding or compar-
with multiple levels, provided a long time is needed
ing systems and a constant variance can be assumed,
to complete the experiment.
then this constant can be estimated using classic ordi-
With this view, even nonsequential designs can be
nary least squares regression, provided no CRNs are
implemented sequentially in ways that are robust to
used and the metamodel is correctly specified. Repli-
early termination. Some single-stage designs can be
cation is then of less concern and the second option
viewed as augmentations of simpler designs, so there
(exploring more scenarios) can be a better way to
is a natural way to separate the designs into two or
spend scarce computer time. Note that a single repli-
more parts. Clearly, sequential or partially sequen-
cate yields an unbiased estimator of the response of
tial designs have this characteristic: after one stage of
a specific scenario. For example, consider a terminat-
sampling the analysts indicate which configuration(s)
ing simulation of a bank that closes at 5:00 p.m. The
should be examined next.
observed maximum queue length during a single day
2.3.3. Data Collection Effort. The increase in is an unbiased estimator of the true maximum. Of
computer speeds has caused some analysts to add course, simulating more days provides a more precise
more details to their simulation models. We believe estimate based on the observed maximum averaged
it should spur us to ask more from our simulation over all simulated days, though it does not change
models. the fact that different scenarios may result in substan-
The traditional concept of a fixed sampling bud- tially different variances in the daily maximum queue
get (in terms of the number of runs) is unnecessar- length.
Kleijnen et al.: A User’s Guide to the Brave New World of Designing Simulation Experiments
INFORMS Journal on Computing 17(3), pp. 263–289, © 2005 INFORMS 273
2.4. Conveying Results Effectively mean, often denoted by 0 ; a saturated design means
The best experiment will come to naught if the results n = k + 1. Actually, there are several saturated designs
are not communicated properly to the decision maker. for a given metamodel type. For the first-order poly-
We refer back to the three primary goals (developing a nomial in k factors, one saturated design changes one
basic understanding, identifying robust solutions, and factor at a time, whereas another design is a fractional
comparing systems). For the first goal, a good anal- factorial (see §4 or Box et al. 1978). To choose among
ogy is exploratory data analysis. Graphical tools that different designs, we also consider the following qual-
allow multidimensional visualization of the results ity attributes.
may be much more helpful than equations or tables.
Useful tools include three-dimensional rotatable plots, 3.2. Orthogonality
contour plots, and trellis plots (Sanchez and Lucas A design is said to be orthogonal if the columns of the
2002). Regression trees and Bayesian networks have design matrix are orthogonal (i.e., the inner product
also been effective ways of communicating which fac- of any two columns is zero). Orthogonality has long
tors are most influential on the performance measures been a desirable criterion for evaluating designs. It
(Gentle 2002, Martinez and Martinez 2002). Yet, visu- simplifies computations. Since the input factors are
alizing simulation results remains a challenge at this uncorrelated, it is easier to determine whether to
stage of simulation experimentation. Tufte (1990) is include them in a metamodel (e.g., using regression)
the seminal reference for excellence in graphical pre- and to separate their contributions to the overall
sentation; Meyer and Johnson (2001) describe tools metamodel fit. This in turn simplifies interpretation of
developed specifically for visually exploring large the results. (Lack of orthogonality, also called multi-
amounts of data from simulation experiments with collinearity, implies that the effect estimates have very
multiple performance measures. large standard errors or cannot be computed at all.)
Unfortunately, requiring orthogonality also has limi-
tations. In reality, some factor level combinations may
3. Criteria for Evaluating Designs not be permissible. For example, in an M/M/1 queue
Once analysts know their situation, the question is: the expected steady-state waiting time is infinite if the
now what? Above we stated that there is no single arrival rate exceeds the service rate. A complicated
prototypical situation (in terms of the type of question application (simulating part of the Rotterdam harbor)
to be asked, or simulation characteristics) that ana- with exploding waiting times for the original orthogo-
lysts might face. In this light, it is not surprising that nal design appears in Kleijnen et al. (1979). In general,
we cannot recommend a specific design. How, then, forcing orthogonal designs may mean limiting many
should analysts choose an appropriate design? While factors to narrower ranges, or figuring out a way to
we do not have all the answers, we do attempt to deal with unstable results for certain scenarios. Unfor-
provide some guidance. tunately, in complex models it may not be possible
In what follows, we use the term design to denote to know a priori which factor-level combinations are
a matrix where the columns correspond to the input problematic.
factors, the entries correspond to (possibly coded) lev- A design may be orthogonal in the coded factor val-
els for these factors, and each row represents a partic- ues (such as −1 and +1) but not in the original factor
ular combination of factor levels also called a design values. Simulation analysts should be aware of pos-
point. More detail on construction and use of these sible scaling effects. Coding all the factor levels can
designs is in this paper’s Online Supplement. facilitate identification of the most important factors
Others have listed desirable attributes for designs (Box et al. 1978, Bettonvil and Kleijnen 1990).
for experiments with real systems (Box and Draper
1987, Myers and Montgomery 2002). We describe cri- 3.3. Efficiency
teria to evaluate designs in simulation settings, and The design determines the standard errors for the
we discuss how they may (or may not) apply directly estimated metamodel parameters. The DOE litera-
to the issues described earlier. We will use these cri- ture uses several criteria (see Kleijnen 1987, p. 335).
teria in deciding which designs to recommend in §4. For example, A-optimality means that the sum of
these standard errors is minimal. D-optimality con-
3.1. Number of Scenarios siders the whole covariance matrix of the estimated
In the literature, a major design attribute is the num- parameters (not only the main diagonal) and means
ber of scenarios required to enable estimation of meta- that the determinant of this matrix is minimal.
model parameters. A design is called saturated if its G-optimality considers the mean squared error of the
number of factor combinations (say) n equals the output predicted through the metamodel (Koehler
number of metamodel parameters, q. For example, if and Owen 1996).
the metamodel is a first-order polynomial in k factors, The criteria above certainly can be (and have been)
then q = k + 1 (where 1 refers to the grand or overall used to evaluate designs proposed for analyzing
Kleijnen et al.: A User’s Guide to the Brave New World of Designing Simulation Experiments
274 INFORMS Journal on Computing 17(3), pp. 263–289, © 2005 INFORMS
simulation experiments. Unfortunately, these criteria or spheres. In simulation experiments, restricting fac-
require strong a priori assumptions on the metamod- tor values to realistic combinations may complicate
els to be fit to the data and the nature of the response the design process dramatically. This is an area seri-
(e.g., variance homogeneity). These assumptions are ously in need of more research. Sanchez et al. (2001)
usually violated in simulation. Consequently, these propose elliptical designs, motivated by observational
criteria are of little value when there is substantial economic data. In many queueing situations, certain
uncertainty a priori on the nature of the simulation’s combinations of factor settings give unstable out-
output. Moreover, focusing on minimizing the num- puts (Kleijnen et al. 1979, Sanchez et al. 2005). Until
ber of design points (or maximizing the efficiency for designs that can handle such situations are available,
a fixed number of design points) may not be enough visual presentation of the results—and exploratory
to insure “efficient” data collection, at least for steady- data analysis—may be the most appropriate ways of
state simulations where it does not make much sense determining whether these situations exist.
to worry about using the most efficient design if one
does not also worry about using the smallest run 3.6. Ease of Design Construction and Analysis
length to achieve the desired goal. In short, efficiency Designs should be easy to construct if they are to be
is most critical when the runs are time consuming. used in practice. Nonetheless, some designs are useful
Other criteria become more relevant when we are able even if difficult to generate, so we do not rule out the
to gather plenty of data quickly. use of tabulated designs, particularly if they are incor-
porated into software packages. The major statistical
3.4. Space Filling and Bias Protection software packages include some experimental-design
Conceptually, space-filling designs sample not only at generation methods. Ideally, design software should
the edges of the hypercube that defines the experi- be readily available for many platforms. One example
mental area, but also in the interior. A design with is WebDOE, which helps users to design their experi-
good space-filling properties means that analysts need ments with deterministic simulation models, offering
not make many assumptions about the nature of the a library of classical designs through an easy-to-use
response surface. Space-filling designs currently pro- Web interface (Crary Group 2004).
vide the best way of exploring surfaces where we do The analysis is also easy if software is available for
not expect to have smooth metamodels. They are par-
many platforms. Regression software is abundant, so
ticularly useful for fitting nonparametric models, such
the most common analysis tool is readily available and
as locally weighted regressions. These designs, espe-
need not be discussed further. Newer surface-fitting
cially Latin hypercube sampling (LHS), have been
methods are also available, including Kriging, neu-
applied when fitting Kriging models (see §4) and neu-
ral nets, radial basis functions, splines, support-vector
ral networks (Alam et al. 2004). Detection of thresh-
regression, and wavelets; see Clarke et al. (2005) and
olds is discussed by Watson and Barnes (1995), who
Antioniadis and Pham (1998). These are metamodel
propose a sequential design procedure.
construction methods that can be applied to data col-
Space-filling designs also provide flexibility when
lected using a variety of experimental designs and
estimating a large number of linear and nonlinear
may do a better job fitting certain complex response
effects, as well as interactions, and so provide general
surfaces. Because we have some experience with Krig-
bias protection when fitting metamodels of specific
ing, which has established a track record in determin-
forms. Other designs do not have good space-filling
istic simulation—and this metamodel is not yet much
properties but still protect against specific violations
of model complexity assumptions. These include the applied in random simulation—we briefly explain the
designs of resolution 3, 4, and 5 below. We also basic approach (see van Beers and Kleijnen 2003, 2004).
refer to Sasena et al. (2002) and Kleijnen and van Kriging is named after the South African mining
Beers (2004), who develop customized (but not space- engineer D. G. Krige, who developed his technique
filling) designs where sequentially selected scenar- while searching for gold. It is an interpolation method
ios are driven by the specific simulation application that predicts unknown values of a random function
at hand. or random process; see the classic Kriging textbook
of Cressie (1993) or the excellent text by Santner
3.5. Ability to Handle Constraints on et al. (2003). More precisely, a Kriging prediction is
Factor-Level Combinations a weighted linear combination of all output values
In some situations (for example, chemical experi- already observed. These weights depend on the dis-
ments) factor values must add up to 100%. The classic tances between the input for which the output is to be
DOE literature presents mixture designs for these sit- predicted and the inputs already simulated. Kriging
uations (Montgomery 2000). Many designs exist for assumes that the closer the input scenarios are, the more
exploring experimental regions (i.e., permissible com- positively correlated the outputs are. This is modeled
binations of design points) that are either hypercubes through the correlogram or the related variogram.
Kleijnen et al.: A User’s Guide to the Brave New World of Designing Simulation Experiments
INFORMS Journal on Computing 17(3), pp. 263–289, © 2005 INFORMS 275
The optimal Kriging weights vary with the input in S-Plus) are intended for real-world data, and so
value for which output is to be predicted, whereas limited to three dimensions.
linear regression uses one estimated metamodel for In theory, if a design is used to generate multiple
all input values. outputs they will be accounted for in the analysis.
If analysts are interested in the I/O behavior within For example, multivariate regression analysis may be
a local area, then a low-order polynomial may be applied. Each output is usually analyzed individually
an adequate metamodel. However, for an experimen- in practice. For linear regression analysis, Khuri (1996)
tal area that is global (not local), Kleijnen and van proves that this suffices if all outputs are generated
Beers (2005) demonstrate that a low-order polyno- by the same design. The same design is indeed used
mial gives poor predictions compared with a Kriging when running the simulation and observing multiple
metamodel. Giunta and Watson (1998) also com- outputs.
pare Kriging with polynomial metamodels. Jin et al.
(2001) compare Kriging with polynomial metamod- 4. Design Toolkit: What Works
els, splines, and neural nets. More recently, van Beers
and Kleijnen (2003) apply Kriging to stochastic simu-
and When
Now that we have identified several characteris-
lation; Jin et al. (2002) discuss the accuracy of Kriging
tics of simulation settings and designs, it is time to
and other metamodels under a sequential sampling
match them together. Consider Figure 1, in which we
approach.
chart some designs according to two dimensions that
Note that in deterministic simulation, Kriging has
together describe the simulation setting. The horizon-
an important advantage over linear regression analy-
tal axis represents a continuum from simple to com-
sis: Kriging gives predicted values at observed input
plex response surfaces. Since the metamodel complex-
values that are exactly equal to the simulated out-
ity depends on both the deterministic and stochas-
put values. Deterministic simulations are used for
tic components, there is not a unique mapping. We
computer-aided engineering in the development of
list some of the assumptions along the axis to inform
airplanes, automobiles, computer chips, computer
the users about the types of metamodels that can be
monitors, etc.; see the pioneering article of Sacks
fit. The vertical axis loosely represents the number of
et al. (1989), and Simpson et al. (2001) for an update. factors. So the lower left represents simple response
Lophaven et al. (2002) have developed a MATLAB surfaces with only a handful of factors, that is, the
toolbox for Kriging approximations to computer mod- traditional DOE setting with Plackett-Burman designs
els, but the commercially supported software prod- developed in the 1940s, etc. The upper right repre-
ucts currently available (such as the Kriging software sents very complex response surfaces with many fac-
combined
designs
Number of Factors
R4
central
coarse grids composite fine grids
R5
Few (2k factorial) (CCD) (mk factorial)
Figure 1 Recommended Designs According to the Number of Factors and System Complexity Assumptions
Kleijnen et al.: A User’s Guide to the Brave New World of Designing Simulation Experiments
276 INFORMS Journal on Computing 17(3), pp. 263–289, © 2005 INFORMS
tors. We do not present a comprehensive list of all general, m levels per factor result in mk designs. When
available designs, but rather describe those that seem there are more than a few factors, analysts may use
most promising and are either readily available or different grids for different groups of factors, employ-
fairly easy to generate. ing finer grids for those factors thought to be impor-
We hope to change the mindset of those who tant to enable them to either view nonlinearities in
might otherwise begin experimentation by focusing the response surface or test the linearity assumption.
on a small number of factors, so we advocate using Unfortunately, the number of scenarios n grows expo-
designs near the top of this figure. In this way, ana- nentially when k increases, so factorial designs are
lysts can look broadly across the factors in the sim- notoriously inefficient when more than a handful of
ulation study. The analyst willing to make simplify- factors are involved. Nevertheless, these designs are
ing assumptions can start from the left of the figure, an important tool since they are easy to generate, plot,
which will tend to reduce the initial data-collection and analyze. Hence, whenever individual run times
effort. (Of course, whenever assumptions are intro- are minimal, the benefit of detailed information about
duced, their validity should be checked later.) Alter- the nature of the response surface may easily out-
natively, the analyst can start from the upper right of weigh the additional computation time relative to the
the figure for an initial experiment if little is known more efficient designs we discuss next.
about the nature of the response. Employing CRNs
or other VRTs can make certain procedures more 4.2. Resolution 3 (R3) and Resolution 4 (R4)
efficient, and perhaps allow the analyst to handle Designs
more factors or make fewer assumptions for a given A design’s resolution determines the complexity of
computational effort. In our experience, VRTs other metamodels that can be fit, with higher-resolution
than CRNs seldom give dramatic efficiency gains— designs allowing more complex models. Specifically,
except for rare-event simulations—but others report “a design of resolution R is one in which no p-factor
they have found VRTs quite effective in large-scale effect is confounded with any other effect contain-
simulations. ing less than R − p factors” (Box et al. 1978, p. 385).
If this initial experiment does not completely ad- Two effects are confounded when they cannot be sep-
dress the main goal, then preliminary results can be arately estimated. For metamodels with main effects
used to design new experiments (augmenting the cur- only (i.e., first-order metamodels with no interaction
rent data) to focus on factors or regions that appear terms), it can be proved that the most efficient designs
most interesting. This corresponds to moving south in are R3 designs, provided the white noise assumption
Figure 1 to focus on the short list of factors selected holds. If k + 1 is a power of two, R3 designs are frac-
after the initial experiment while holding the remain- tional factorial designs, denoted as 2k−p designs where
ing factors to only a few configurations. If assump- the total number of design points is 2k−p . If k + 1 is
tions about the response-surface complexity are made not a power of two but is a multiple of four, then R3
for the initial experiment, then moving south-east is designs are tabulated as Plackett-Burman designs. See
beneficial to check their validity. If few assumptions any DOE textbook for details (e.g., Box et al. 1978).
are made and the initial analysis indicates that the If interactions are assumed to be present but users
response surface is not very complex, then moving are mainly interested in estimating first-order effects,
south-west allows the analyst to take advantage of then R4 designs are appropriate. These designs give
highly efficient designs in subsequent experiments. unbiased estimators of main effects even if two-
We now provide brief descriptions of the designs factor interactions are present. They can be easily
in Figure 1 and their characteristics, along with refer- constructed through the fold-over procedure, i.e., after
ences for details. A few sample designs are given in executing the R3 design, run the mirror design that
the Online Supplement (also see Kleijnen 2005). reverses each high and low value in a specific factor’s
column. In other words, proceed in two stages by first
4.1. Gridded or Factorial Designs running an R3 design and then augmenting it to an R4
Factorial designs are easy to explain to someone unfa- design. (See also the RSM designs in Donohue et al.
miliar with classic DOE. A popular type of factorial is 1993.)
a 2k design, which examines each of k factors at two Even if the white-noise assumption does not hold,
levels, and simulates all resulting combinations. Then classic designs produce unbiased estimators of the
it is possible to fit a metamodel including all inter- metamodel parameters, although not necessarily with
actions, not only between pairs of factors, but also minimum standard errors. If we account for analysts’
among triplets, etc. time and energy, then these designs seem acceptable.
Considering more complex metamodels (i.e., mov- Clearly, R3 designs (which use all scenarios to esti-
ing to the right in Figure 1), the analysts may use finer mate all effects) give smaller standard errors for the
grids. Three levels per factor result in 3k designs; in estimated first-order effects than the popular practice
Kleijnen et al.: A User’s Guide to the Brave New World of Designing Simulation Experiments
INFORMS Journal on Computing 17(3), pp. 263–289, © 2005 INFORMS 277
of changing one factor at a time (which use only two model—possibly augmented with two-factor interac-
scenarios per effect). tions—may suffice. Moreover, users may be able to
specify the sign (or direction) of each potential main
4.3. Resolution 5 (R5) Designs effect. In these situations, the individual factors can
If users are also interested in the individual two-factor be aggregated into groups such that individual main
interactions, then an R5 design is needed. Few 2k−p R5 effects will not cancel out. Group screening can be
designs are saturated. Saturated designs include those very effective at identifying important factors. A prac-
of Rechtschaffner (1967), discussed by Kleijnen (1987, tical and efficient group screening procedure is SB.
pp. 310–311) and applied by Kleijnen and Pala (1999). For example, in an ecological case study, 281 factors
R5 designs require Ok2 factor combinations, so are are screened after only 77 factor combinations are
less attractive if individual runs are time consuming. simulated, resulting in only 15 important factors; see
If an R4 design suggests that certain factors are unim- Bettonvil and Kleijnen (1997). If interactions might be
portant, then computing requirements can be reduced important, SB still gives unbiased estimators of the
by limiting the R5 design to fewer factors. The 2k−p main effects, provided the number of combinations
designs can be looked up in tables (Box et al. 1978, is doubled (similar to the fold-over principle for R3
Kleijnen 1974–1975, 1987, or Myers and Montgomery and R4 designs discussed above). If allowed to run to
2002), but they are relatively easy to construct and completion, SB will keep subdividing factor groups
therefore can be automated. unless the estimated aggregate effect for a group is
Fractional factorial designs (including R3, R4, and either insignificant or negative, or it identifies individ-
R5 designs) meet classic optimality criteria such as ually significant factors. However, SB can be stopped
D-optimality for specific metamodels. Other designs at any stage, and it will still provide upper bounds for
that satisfy these criteria are derived in optimal aggregated effects, as well as estimates of any indi-
design theory, pioneered by Kiefer (1959) and Fedorov vidual effects already identified. The most important
(1972); see also Pukelsheim (1993) or Spall (2003).
factor is identified first, then the next most important
These so-called optimal designs typically lack the sim-
factor, and so on. Consequently, SB is robust to pre-
ple geometric patterns of classic designs, and are too
mature termination of the experiment.
complicated for many practitioners.
Bettonvil and Kleijnen (1997) discuss SB for de-
4.4. Central Composite Designs (CCD) terministic simulations. Cheng (1997) extends the
A second-order metamodel includes purely quadratic method for stochastic simulations. Kleijnen et al.
effects in addition to main effects and two-factor inter- (2005) also discuss SB for random simulations, includ-
actions. This means that the response functions need ing a supply-chain case study. Wan et al. (2003)
not be monotonic. Best known designs for this case propose a modification, called controlled sequen-
are CCD, with five values per factor. These values are tial bifurcation, and provide proof of its perfor-
coded as 0, ±1, ±c, with c = 0 1. It is possible to deter- mance under heterogeneous-variance assumptions.
mine an optimal value of c if the white-noise assump- Other screening techniques with less restrictive meta-
tion holds. Since this assumption does not hold for models are discussed by Campolongo et al. (2000),
most simulation experiments, we do not worry too Dean and Lewis (2004), Holcomb et al. (2000a, b), Lin
much about the choice of c except to suggest that ana- (1995), and Trocine and Malone (2001). Their perfor-
lysts choose an intermediate value for better space fill- mance relative to SB needs further research.
ing. Details on CCD can be found in any DOE textbook
(Box et al. 1978, Box and Draper 1987, Montgomery 4.6. Latin Hypercube Sampling (LHS)
2000, or Myers and Montgomery 2002). For situations involving a relatively large number of
Actually, estimation of quadratic effects requires factors, McKay et al. (1979) proposed LHS. Let k still
no more than three factor levels, so to save com- define the number of factors, let n denote the num-
puter time analysts may again use saturated designs, ber of design points desired (n ≥ k, and define n
which implies n = 1 + k + kk − 1/2 + k, namely, levels per factor. Each column of the design matrix
one overall mean, k main effects, kk − 1/2 interac- is a random permutation of the factor levels. LHS is
tions, and k purely quadratic effects. Kleijnen (1987, so straightforward that it is incorporated in popular
pp. 314–316) discusses several saturated design types, add-on software (such as @Risk) for spreadsheet sim-
including so-called simplex designs and fractional ulation; see Sugiyama and Chow (1997).
3k designs. Kleijnen and Pala (1999) apply simple sat- LHS designs have good space-filling properties—
urated designs; see also Batmaz and Tunali (2003). particularly if several LHS designs are appended—
so they are efficient ways of exploring unknown,
4.5. Sequential Bifurcation (SB) but potentially complicated response surfaces with
In practice, there are situations with many factors many quantitative factors. For LHS in Kriging (which
but few important ones. In such cases, a main-effects assumes smooth metamodels, possibly with many
Kleijnen et al.: A User’s Guide to the Brave New World of Designing Simulation Experiments
278 INFORMS Journal on Computing 17(3), pp. 263–289, © 2005 INFORMS
local hilltops) see Koehler and Owen (1996), Morris the decision factors only. So their metamodel (while it
and Mitchell (1995), Simpson et al. (2001), Pacheco may be complex) does not require estimation of all
et al. (2003), or Santner et al. (2003). factor and interaction effects. Actually, the noise fac-
There are numerous variants of basic LHS. Assum- tors enter into the metamodel via their impact on the
ing a linear metamodel, Ye (1998) developed an algo- variability of the response for a particular combina-
rithm for orthogonal LHS. Cioppa (2002) extended the tion of decision-factor levels. This clear division of
number of factors that can be examined in orthogo- factors suggests that the analysts sample the two sets
nal LHS within a fixed number of runs. Moreover, he differently—for example, by crossing a high-resolution
found that by giving up a small amount of orthog- design for the decision factors with a lower reso-
onality (allowing pairwise correlations between the lution design for the noise factors. Crossing means
design columns less than 0.03), the analysts can dra- that each combination of decision-factor values is
matically increase the space-filling property of these simulated for each environmental scenario, which is
designs. His LHS designs are not easy to generate, defined by the combination of values of the environ-
but are tabulated (Cioppa and Lucas 2005) and thus mental factors. These environmental scenarios enable
useful in situations where the total number of runs is estimation of the mean and variance of the simula-
limited (perhaps because individual simulation runs tion response per combination of decision factor val-
are time consuming). ues. Instead of a crossed design, the analyst may use
a combined (or combined array) design (Shoemaker
4.7. Frequency-Based Designs et al. 1991, Myers et al. 1992). In a combined design,
For quantitative factors, a frequency-based approach a single design matrix (such as a factorial) is used
makes each factor oscillate sinusoidally between its with columns divided among parameters and noise
lowest and highest value at a unique and carefully factors. As Myers et al. (1992) suggest, this can lead to
chosen frequency. If the simulation is coded so that a great reduction in the data-collection effort since the
factors can be oscillated during the course of a run only interactions that need to be estimated are those
(called the signal run), then comparisons can be made involving two decision factors. Sanchez et al. (1996)
to the noise run where all factors are held at nominal apply both crossed and combined designs to explore
levels. This approach has been advocated as a screen- a job-shop simulation model. In §6 we illustrate a
ing tool for identifying important metamodel terms; crossed design to identify robust decision-factor set-
see Schruben and Cogliano (1987) and Sanchez and tings in a small case study; the design is provided in
Buss (1987). the Online Supplement.
More recently, frequency-based designs have been Many types of designs have been used in this
used to set factor levels for scenarios externally. That context. Taguchi (1987) proposes a particular class
is, factor levels remain constant during the course of of orthogonal designs, but these are intended for
the simulation run, but they change from run to run; factory experiments and are limited to main-effects
see Lucas et al. (2002) or Sanchez and Wu (2003). models, which we find too restrictive for simulation
These designs have reasonably good space-filling environments. Ramberg et al. (1991) use a sequential
properties. Moreover, there is a natural gradation in approach, beginning with a 2k−p design augmented
the granularity of sampling. Factors oscillated at low with a center point for the decision factors, and a
frequencies are sampled at many levels, whereas fac- saturated or nearly saturated factorial for the noise
tors oscillated at high frequencies are sampled at factors. Moeeni et al. (1997) use three levels (varied
fewer levels. This property may help analysts design across runs) per decision factor and frequency-based
an experiment to be robust to early termination, for oscillation (varied within a run) for 35 noise factors.
example, by choosing higher oscillation frequencies Cabrera-Rios et al. (2002) propose three levels per
for those factors believed a priori to be most impor- decision factor and two levels per environmental fac-
tant. By carefully choosing the oscillation frequencies, tor. If the number of decision factors is not too large,
it is possible to use the results to fit second- and then the analysts may cross a CCD for the decision
third-order metamodels. The designs are relatively factors with LHS for the noise factors; see the case
easy to construct and to implement (Jacobson et al. study in Kleijnen et al. (2005). If the number of deci-
1991, Morrice and Bardhan 1995, Saltelli et al. 1999, sion factors is large, then orthogonal or nearly orthog-
or Sanchez and Wu 2003). onal LHS may be a good design for the decision
factors. In short, crossed designs are easy to generate,
4.8. Crossed and Combined Array Designs and the two subdesigns can be chosen to achieve the
Selecting designs for finding robust solutions falls nat- characteristics (space-filling, orthogonality, efficiency)
urally into the upper middle portion of Figure 1. that are most pertinent to the problem at hand.
While there may be many factors, the analysts are Crossed designs can be exploited in situations other
interested in a metamodel that captures the impact of than robustness studies. Lucas et al. (1997) give an
Kleijnen et al.: A User’s Guide to the Brave New World of Designing Simulation Experiments
INFORMS Journal on Computing 17(3), pp. 263–289, © 2005 INFORMS 279
example of group screening within a fractional fac- the agent-based models discussed earlier. Sometimes
torial design crossed with LHS. Lucas et al. (2002) intuition is wrong and needs to be challenged. For
discuss the benefits of combining multiple designs example, Smith and Sanchez (2003) describe a fore-
after classifying factors into several groups based on casting project where the model of losses (incurred
their anticipated impact. This allows analysts much for certain groups of loans) had the “wrong” signs.
more flexibility than simply putting each factor into Examination of the detailed files confirmed that their
(or leaving it out of) the experiment. patterns differed from the vast majority of loans and
revealed why, so that the model ended up providing
4.9. Summary new—and valid—insights to experts. Another exam-
We have presented several design options for sim- ple is the ecological case study that Bettonvil and
ulation experiments involving either a few or many Kleijnen (1997) use to demonstrate SB: the resulting
factors. If runs are extremely time consuming, then short list of factors includes some that the ecological
analysts can reduce the computational effort by experts had not expected to be important.
making assumptions about the nature of the response Another check compares the metamodel predictions
surface. These assumptions can be checked after the to the simulation outputs for one or more new sce-
runs are completed, as we describe in §5. We contrast narios (which might be selected through a small LHS
this approach to arbitrarily limiting the number of fac- design). If the results do not differ significantly, the
tors. Indeed, if the analysts change only a few factors metamodel is considered acceptable (see any textbook
while keeping all other factors constant, then the con- on linear models such as Kutner et al. 2004, also
clusions of the study may be extremely limited. Kleijnen et al. 1998). Kleijnen and Sargent (2000) dis-
We have not attempted to list all designs that have cuss how to use output from initial simulation exper-
been proposed for simulation experiments. For exam- iments to test the metamodel constructed from other
ple, we have not placed any simulation optimiza- scenarios in subsequent experiments. They refer to
tion methods in Figure 1, although optimization can this as validating metamodels, not to be confused with
be viewed as a means of comparing systems under validating a simulation model.
very specific conditions. Our goal is to suggest some The assumption of normal IID errors can be exam-
designs that analysts can readily use. ined via residual analysis (if regression is used to
fit the metamodels), or by taking additional replica-
tions at a few design points. Tunali and Batmaz (2000)
5. Checking the Assumptions investigate procedures for validating this and other
Whichever design is used, sound practice means assumptions for least-squares metamodel estimation.
checking assumptions. With designs from the right of Note that higher-order interactions are notoriously
Figure 1, few assumptions are made about the nature difficult to explain to users; nevertheless, traditional
of the response surface. In the process of fitting a DOE routinely estimates and tests these interactions.
metamodel, analysts determine what (if any) assump- One solution transforms the original inputs or out-
tions are reasonable. If they start in the upper left puts of the simulation model, to simplify the meta-
corner of Figure 1, then the experiment is likely used model. For example, replacing two individual factors
to screen the factors and identify a short list as the by their ratio may help in queueing simulations
focus of subsequent experimentation. If so, there are where the arrival and the service rates are com-
likely to be fewer assumptions during the next stages bined into the traffic rate; in combat models the rel-
of experimentation. If they start from the lower left ative strength may provide a better metamodel than
(as traditional DOE does), then it may be essential to the individual absolute strengths of the two com-
confirm that the resulting metamodel is sufficient, or batants. Furthermore, logarithmic transformations of
to augment it appropriately. inputs and outputs may provide a better-fitting meta-
One check has the signs of the estimated effects model in queueing problems; see Kleijnen and Van
evaluated by experts on the real system being sim- Groenendaal (1992).
ulated. For example, does a decreased traffic rate Unfortunately, it may be difficult or impossible to
(resulting from adding or training servers) indeed transform individual factors to achieve simple meta-
reduce the average waiting time? Another example is models, particularly when multiple performance mea-
the case study by Kleijnen (1995) on a sonar simu- sures are collected. One option might be to transform
lation, in which naval experts evaluate the signs of certain responses. We have observed instances where
the metamodel effects; because all signs are accepted, a transformation serendipitously yields responses of
the underlying simulation model is considered to be direct interest to the decision maker (such as the dif-
“valid.” In general, checking the signs may be particu- ferences in, rather than magnitudes of, sensor ranges),
larly applicable when the goal of the simulation study while allowing the analyst to fit simpler models in the
is general understanding rather than prediction, as for transformed spaces.
Kleijnen et al.: A User’s Guide to the Brave New World of Designing Simulation Experiments
280 INFORMS Journal on Computing 17(3), pp. 263–289, © 2005 INFORMS
Transformations of responses are sometimes ap- factor-level combinations are no longer of primary
plied to satisfy the assumptions underlying the sta- interest, and therefore additional experiments are con-
tistical analysis, rather than to construct performance ducted. When is it appropriate to use a global meta-
measures of interest to the decision maker or to sim- model (with data from all experiments) instead of
plify the form of the metamodel. This may require a focusing on several local metamodels (over smaller
specialized analysis approach. For example, Irizarry ranges)? This question merits additional research.
et al. (2003) develop the so-called MLE-Delta method
after finding that constructing metamodels and back- 6. Case Study: Humanitarian
transforming the results may yield highly biased
point and confidence interval estimates of the original
Assistance Operations
Clearly, no single investigation will use all experimen-
(untransformed) responses. Often, simpler methods
tal designs described in §4 even though they represent
that do not rely on transformation suffice, particu-
only a subset of possible designs. To illustrate a num-
larly when we seek insights rather then predictions.
ber of the points made in the paper, we now present
For example, classic ordinary least squares (OLS)
a small case summary of an investigation of an agent-
assumes normally and independently distributed
based model of humanitarian assistance operations in
simulation responses with constant variances across
urban environments. We use a scenario developed by
different scenarios, but the resulting estimators are
Wolf (2003) but expand the investigation to illustrate
unbiased even when the variances are not constant, the central points of this paper more fully. Additional
and the usual test statistic is known to be very insen- details are provided in the Online Supplement.
sitive to departures from the normality assumption. Wolf (2003) examines a humanitarian assistance
This means a readily available tool can be used for operation implemented using the MANA software
analysis. A simple method that accommodates vari- platform (Lauren and Stephen 2002). A convoy with
ance heterogeneity and CRNs (but does not rely on a security escort follows a given route to the most
transformation) replicates the design (say) m times, southern of two food-distribution sites in an urban
estimates the metamodel from each replication, and environment. The convoy enters from the northeast
computes confidence intervals for the metamodel’s corner, traveling west, and then turns south toward
parameters and predictions from these m replicates its final destination. Initially, the northern civilians
using a Student t statistic with m − 1 degrees of free- make their way to the northern site, while the south-
dom (Kleijnen 2005, Appendix A). We reiterate that if ern civilians move toward the southern site. As the
substantial variance heterogeneity is present, it should northern civilians sense the trucks passing by, they
be characterized and conveyed directly to the decision speed up and try to follow the trucks. A lone aggres-
maker. sor searches for the convoy, provides harassing fire,
Most practical simulation models have (say) w mul- and then runs away. The security escort returns fire if
tiple outputs, or a multivariate response in statis- it identifies the aggressor, while the convoy responds
tical terminology. Fortunately, in the case of linear by speeding up and driving out of the area. Once it
regression modeling, the OLS estimators computed reaches the southern site, the convoy begins distribut-
for the w individual responses are identical to the gen- ing food. The simulation runs for a fixed time that
eralized least squares estimators whenever the mul- represents a single day’s operation; initial conditions
tivariate response is generated by a single design differ across runs due to random initial placement of
(Rao 1967). agents within a defined border. The output measures
Even with careful thought and planning, it is rare include the numbers of northern and southern civil-
that the results from a single experiment are so com- ians fed, whether or not the aggressor is killed, and
prehensive that the simulation model and its meta- whether one of the convoy trucks is destroyed.
model(s) need never be revisited. In practice, results Several of the goals for this initial investigation
from experiments often need to be modified, i.e., (Wolf 2003, Wolf et al. 2003) are consistent with those
expanded or thrown out to obtain more detailed infor- in §2.1. One goal is to see whether gaining a better
mation on the simulation performance for a smaller understanding of the model’s behavior offers general
region of the factor combinations. These modifica- insights to those interested in using agent-based mod-
tions are determined in large part by the exper- els for humanitarian assistance operations. A second
tise of the analysts. This illustrates a need for semi- goal is to determine whether a robust strategy exists for
automatic methods for suggesting design refinements, the convoy; that is, are there choices that improve its
which can be tricky. For example, suppose the ana- ability to feed people over a broad range of environ-
lysts have built a response-surface model that accu- ments? If no robust strategy emerges, policies can be
rately characterizes simulation performance over a compared within a few environments to see if appro-
particular region of the factor space. Over time, priate strategies can be identified for more limited
the external environment changes so that the initial ranges of external conditions.
Kleijnen et al.: A User’s Guide to the Brave New World of Designing Simulation Experiments
INFORMS Journal on Computing 17(3), pp. 263–289, © 2005 INFORMS 281
Forty factors are chosen for exploration. This means must be to the convoy to see or hear it passing by,
from the outset that efficient experimental designs and how close they must be to other civilians to share
are needed to allow a broad look across the fac- this information. This plot was constructed using JMP
tors. The factors are all quantitative, and involve an IN® (SAS Institute 2004), but other statistical packages
agent’s propensities for movement under different can produce similar plots. The plotted points are the
environmental conditions, and its ability to sense, average number of civilians fed for each of the 640
communicate, and interact with other agents. The design points. Each diamond represents a 95% confi-
15 convoy and security factors are considered deci- dence interval for the mean response associated with
sion factors since they reflect actions or capabili- specific communication distances, i.e., a set of “rea-
ties of the Marines. Attributes and behaviors of the sonable” values for the underlying mean response.
civilians (14 factors) and the aggressor (11 factors) The diamond’s upper and lower points correspond
characterize a variety of environments in which the to the upper and lower confidence interval limits,
food-distribution operation could take place. Few and the center line is the mean. The circles on the
assumptions are made about the nature of the right portion of the graph correspond to simultane-
response surface, so the experimental setting falls in ous 95% confidence intervals for all pairwise compar-
the upper right corner of the design space in Figure 1. isons. These arise from the Tukey-Kramer honestly
By appending 16 square random Latin hypercubes significant difference (HSD) test, which is an exact
(each involving 40 design points), Wolf (2003) exam- test if the sample sizes are the same and a conserva-
ines the impact of simultaneously changing the spec- tive test if they are unequal; see Hayter (1984). Click-
ified values of these 40 factors. The final experiment ing on one of these circles highlights (in boldface)
(with 50 replications at each of 16 × 40 = 640 design the communication distance in question, and displays
points) requires 32,000 simulation runs, but even this the set of all communication distances whose aver-
relatively large number of runs can be completed in age responses are not significantly different by unital-
7.5 hours on a computing cluster. icizing the labels and displaying both the circles and
For exploratory purposes, we begin by averag- labels in red. We box these to aid the reader viewing
ing the responses at each design point and graphi- the plot in black and white. This type of graph is use-
cally assessing the results. Rather than summarizing ful because it allows the analyst to check the overall
the analysis in Wolf (2003), we present examples of direction of main effects, as well as explore whether
types of graphs that we have found particularly use- there are any interesting patterns, clusters, or outliers
ful for developing an understanding of the system among the means. We have uncovered software logic
behavior. Unfortunately, static black-and-white pic- problems using similar plots.
tures cannot fully portray the insights gained from Once the data have been screened for extreme out-
using these multicolor graphs interactively, but we liers, regression models can be used to identify fac-
briefly describe the dynamic aspects in the text. tors and interactions that seem to play important
Figure 2 is a mean diamonds plot of the average num- roles in determining the response. Wolf (2003) fits
ber fed as a function of the communication distance, several models, balancing model simplicity against
where this distance indicates how close the civilians the explanatory power. He examines how the convoy
70
60
Mean Fed
50
40
30
5 7 10 15 20 24 29 34 39 44 49 54 59 63 68 73 78 83 88 93 98 All Pairs
Tukey-Kramer
Tukey-Kramer
highlighted 12 17 22 27 32 37 42 46 51 56 61 66 71 76 81 85 90 95 100
value Communication Distance 0.05
0.05
Figure 2 Mean Diamonds Plot of Average Number Fed vs. Communication Distance
Kleijnen et al.: A User’s Guide to the Brave New World of Designing Simulation Experiments
282 INFORMS Journal on Computing 17(3), pp. 263–289, © 2005 INFORMS
70 1 1
Mean Fed
60 1
50 Max 500 500
Info Age 500
40
30
70
Mean Fed 5
60 5
5 100
50 Contact 100
100 Distance
40
30
70
Mean Fed
60 100
50 100 100 Commun
Distance
40 5 5 5
30
70
Mean Fed
60 100
100 Contact
50 100 5 Contact
5 5 Commun
40 Commun
Distance
30
agents can affect the success of the food-distribution the shared information may be old (maximum infor-
operation by specifying appropriate values for factors mation age = 500), but fewer are fed than when cur-
under their control, but no robust strategy emerges rent information is available.
since he finds that the convoy’s decision factors have The above graphs and analytic methods all deal
very little impact on performance. The model is with a single performance measure at a time. A graph
improved substantially if communication distance is often used in the social sciences is the parallel coor-
used as a surrogate for the convoy broadcasting its dinates plot (Wegman 1990). This displays several
presence while traveling and reclassified as a decision variables (input factors and performance measures)
factor. simultaneously and connects the (scaled) variable val-
Interpreting the impact of model terms is often dif- ues for each design point with lines. Figure 4 is a par-
ficult from the regression output, particularly when allel coordinates plot for the humanitarian-assistance
interactions and quadratic effects are present or the study. For simplicity, we plot only 4 of the 40 input
factors have very different scales. Interaction profiles, factors and four performance measures. Because each
such as those in Figure 3, are useful graphical sum- input factor takes on 40 values in our structured
maries. The tiny subplots depict the four significant design, the left of the plot is quite dense. The perfor-
interactions in a model involving only communication mance measures do not follow regular patterns.
factors. The factors are the maximum age of infor- Once again, plots like Figure 4 are most useful
mation shared among civilians, how close the civil- when they can be used interactively, when clicking on
ians need to be to the convoy to begin following a particular line will highlight the values of all vari-
it (which also determines how close other civilians ables for the corresponding design point. The three
must be to influence movement), the communication design points corresponding to the highest observed
distance (described earlier), and the communication convoy losses are highlighted in Figure 4. These
distance once they come into contact with the con- design points are also associated with lower losses
voy (which may differ). Curves indicate quadratic for the security escort and moderate probabilities of
effects and dashed lines indicate that interactions killing the aggressor, but sizeable numbers of civil-
are not present. For example, the third box in the ians are still fed. The highlighted lines also relate the
top row shows that when information stays cur- responses to the input factors. For example, the ini-
rent (maximum information age = 1), increasing the tial communication distances are all high, while no
communication distance also increases the average particular pattern is revealed for the contact distance.
number of civilians fed. Increasing communication Ideally, highlighting design points in the plot also
distance is also valuable—up to a point—even when highlights them in the data set so interesting subsets
Kleijnen et al.: A User’s Guide to the Brave New World of Designing Simulation Experiments
INFORMS Journal on Computing 17(3), pp. 263–289, © 2005 INFORMS 283
ax info age
Mean fed
nvoy killed
curity killed
essor killed
n distance
mmun dist
ct distance
contact max info contact commun avg prob prob avg
commun age distance distance convoy security aggressor fed
distance killed killed killed
Figure 4 Parallel Coordinates Plot with Four Factors and Four Performance Measures
of the data can be easily extracted and examined in orthogonal LH design for the four decision factors,
more detail. and an 8-factor, 33-run nearly-orthogonal LH design
Scatter plots can display how two responses are for the noise factors. Here we group all 27 not-so-
related to one another, though some care must be interesting noise factors to form the eighth factor. We
taken when the data sets are large. Since multiple then cross the two designs and run 50 independent
points may result in the same pair of response val- replications at each of the 561 design points for a total
ues, a straightforward scatter plot will not necessarily of 28,050 runs (details are provided in the Online Sup-
reveal where the majority of the data are clustered. plement). Our main purpose is to facilitate compar-
Instead, points can be offset slightly by addition of isons in our search for robust solutions by ensuring
random noise (sometimes called jitter), the point sizes orthogonality among the decision factors. The near-
can vary based on the number of data points rep- orthogonality of the noise-factor design also makes it
resented, or points can be plotted using transparent easier to identify strategies that depend on noise fac-
colors so that intense colors correspond to high data tors deemed important in the initial phase of analysis.
densities. (Similar approaches may be valuable for At the same time, embedding the other noise factors
residual plots following regression analysis on very in the nearly-orthogonal design allows us to check the
large data sets.) The data can be displayed as a col- initial conclusion that these factors are less important.
lection of small plots or graphs, e.g., by splitting the We keep the sample size roughly comparable to that
data into subsets according to values of two input fac- of the first experiment so that results can be obtained
tors and constructing a scatter plot of two responses within one working day.
for each subset. These collections are also called small We use squared-error losses to examine the robust-
multiples, trellis, or tiled plots. ness of the responses. This entails specifying a target
In all, Wolf (2003) explores several different mod- that represents the “ideal” response value. In our case,
els based on different sets of potential factors, includ- a natural target is the total number of civilians seeking
ing models involving the convoy’s decision factors, food ( = 70), although other values are possible. If
models involving communication factors, and mod- x and x2 denote the response mean and variance at a
els involving all decision and noise factors. We draw particular design point x, then the expected loss is pro-
on the results of his exploratory investigation to portional to x2 + x − 2 (see Ramberg et al. 1991, or
develop a second experimental design to examine Sanchez et al. 1996). We compute the average (scaled)
robust strategies more closely. Two of the original losses for each of the 17 design points. The most robust
40 factors are dropped (Wolf 2003), and the remain- of these design points has an average loss of 375; its
ing 38 are divided into three groups: four decision settings correspond to high civilian communication
factors (the initial convoy and security movement distances with the convoy and security escort trav-
speeds, along with the northern civilians’ communi- eling at moderately low speeds, and results in feed-
cation and contact communication distances), seven ing 57, on average, which is still far from the highest
environmental factors that show up in at least one of possible value. Average losses for three other design
his models, and 27 environmental factors that have lit- points are less than 424 (13% larger). The remaining
tle apparent impact on humanitarian assistance oper- design points are far from robust; their average losses
ations in the initial experiment. We use a 17-run range from 540 to 1170 (44% to 212% larger).
Kleijnen et al.: A User’s Guide to the Brave New World of Designing Simulation Experiments
284 INFORMS Journal on Computing 17(3), pp. 263–289, © 2005 INFORMS
While the above analysis allows us to compare term, and one quadratic works essentially as well
robustness of specific design points, finding out how (R2 = 030) as a full second-order model (R2 = 031).
the decision factors affect robustness is also of inter- (Residual plots reveal some variance heterogeneity,
est. A nonparametric tool we find particularly useful but the OLS estimators are nonetheless unbiased.) The
is a regression tree (see Gentle 2002) that recursively results suggest that the convoy and security escort
partitions the data to provide the most explanatory should travel slowly and broadcast their locations,
power for a performance measure of interest. Figure 5 particularly once they come into contact with civil-
is a portion of the regression tree for the loss (aver- ians. This combination of factor values is not one of
aged across replications). The first split of the data the design points for the decision factors, but the most
(not shown) involves communication distance: when robust of the 17 decision-factor combinations has the
this distance is less than 17, the average loss is very highest average communication distance and moder-
high (1158); otherwise the loss is much lower (576). ately low speeds. So, the regression results comple-
Figure 5 shows the next three splits. The “leaves” ment those of the regression tree, and can suggest
at the bottom of the branches in the regression tree alternatives whose robustness can easily be checked
denote subsets of the data that are similar in terms with a set of confirmation runs.
of their performance. Additional information, such Finally, since we cross an orthogonal design matrix
as the overall R2 value (a measure of the model’s with a nearly-orthogonal one, we can assess the
explanatory power) and the number of points and impact of adding noise (environmental) factor terms
response standard deviation associated with each leaf, to our regression model without worrying about mul-
is available in the computer output. For the tree in ticollinearity. Adding the significant noise factors and
Figure 5, R2 = 029 (i.e., the tree “explains” 29% of the decision-by-noise interactions to the model increases
variability in the response). Constructing a regression R2 from 0.30 to 0.62. An examination of the signs asso-
tree is an interactive process. Leaves are added until ciated with the noise factors and interactions indicates
the analyst is satisfied that enough explanatory power that setting all factors to their low levels is a favor-
is obtained, branches can be pruned to simplify the able environment for the relief efforts, while setting
model, and splits can be forced at certain leaves to all to their high levels is unfavorable. This could be a
examine smaller subsets of the data in more detail. first step in adapting the convoy’s tactics to suit the
We find it useful to tag the leaves as green, yellow, environment.
or red (for good, fair, or poor responses, respectively) As always, the results must be conveyed to deci-
when presenting the results. sion makers effectively. We find that regression trees
Regression trees are a nonparametric approach to are often easier to explain than regression equations.
fitting a response to a set of data. Multiple regression Three-dimensional surface plots (landscapes) are use-
can be used to suggest alternatives (i.e., combina- ful and easily understood when gridded data are
tions of factor levels that have not yet been exam- available. So, after a broad exploration is complete,
ined) that might perform even better. Accordingly, we it may be beneficial to run a gridded experiment
fit second-order models of the average loss involving involving only four or five factors to facilitate display-
only the decision factors. Another possibility would ing the output. Once again, small multiples can be
be to construct separate models for the response mean used to compare and contrast the surfaces when hold-
and variability (Sanchez et al. 1996). A simplified ing the undisplayed factors to different levels, or to
model involving four main effects, one interaction compare several performance measures. For example,
Kleijnen et al.: A User’s Guide to the Brave New World of Designing Simulation Experiments
INFORMS Journal on Computing 17(3), pp. 263–289, © 2005 INFORMS 285
Response
Response
Factor 1 Factor 1
Factor 2 Factor 2
each subplot in Figure 6 shows the average response lation experiment, namely, (i) understanding a sys-
for a performance measure as a function of two input tem, (ii) finding robust solutions, and (iii) compar-
factors, while three other input factors are held at ing two or more systems. We contend that the above
specified levels. It is sometimes informative to plot goals are often more appropriate than those typi-
multiple surfaces on a single graph, such as the lower cally used, namely, testing hypotheses about factor
and upper quartiles for a particular performance mea- effects, seeking an optimal policy, or making predic-
sure. For nongridded data, contour plots can provide tions about performance. To illustrate our points, we
similar insights. Both surface and contour plots are describe examples from decades of combined expe-
readily available in spreadsheets and statistical soft- rience. We also describe many characteristics of the
ware, though once again there are benefits to interac- simulation setting that call for nontraditional designs
tive graphs. For example, slider bars corresponding to as part of the simulation analyst’s toolkit. In particu-
the undisplayed factors’ values provide an easy way lar, simulation experiments are often characterized by
to see how the landscapes change; some statistical a large number of potential factors, complex response
packages have similar profiling tools to see how the surfaces, time-varying correlated output streams, and
fitted surfaces change. multiple performance measures. Analysts also have
In summary, by using efficient experimental de- the opportunity to control simulation-specific factors
signs coupled with modern graphic and analytic (such as run lengths, random-number streams, and
tools, we are able to examine 40 factors at each of 40 warm-up periods) that can be exploited for additional
levels simultaneously. The setup time is minimal since design efficiencies. Steady-state simulations offer the
the factor levels are specified using spreadsheets, and possibility for batching output or conducting very
the resulting file is used to update the factor values long runs that may not have useful analogs in real-
automatically to run the experiment. The total com- world experiments.
putational effort for the two experiments (50 replica- Another change in mindset occurs when analysts
tions at 1201 design points) is less than two-thirds the begin thinking explicitly about sequential experimen-
amount required to run a gridded design for any two tation. This has two major implications. First, it means
of the factors at 40 levels (402 = 1600 design points) that a sequence of experiments may allow the analyst
or any 11 factors at only two levels (211 = 2048). While to gather insights efficiently. Second, even for one-
the results indicate that second-order models suffice shot experiments, it may be beneficial to sequence the
for this particular example, our designs require no simulation runs appropriately to allow for useful par-
such assumptions. The graphic and analytic results tial information as preliminary results become avail-
also provide insights that cannot easily be obtained able or in case the experiment is halted prematurely.
using trial-and-error or by restricting the list of poten- We argue that the data-collection effort consists not
tial factors. only of the number and length of the simulation runs,
but also the effort required to generate the experi-
mental designs and manage the runs. Emphasizing
7. Conclusions and Future Research solely the former may unnecessarily limit the choice
Our primary goal in writing this paper is to help of experimental designs. A related idea is the ben-
change the mindset of simulation practitioners and efit of coding the simulation model in a way that
researchers. Indeed, we believe that practitioners facilitates creating a list of potential factors and sub-
should view DOE as an integral part of any simu- sequently modifying their levels. At the same time,
lation study, while researchers should move beyond conveying the results effectively remains a challenge
viewing the simulation setting merely as an applica- for high-dimensional response surfaces.
tion area for traditional DOE methods. We advocate We discuss several criteria for evaluating designs,
thinking first about three potential goals of a simu- and provide guidance on selecting designs suitable
Kleijnen et al.: A User’s Guide to the Brave New World of Designing Simulation Experiments
286 INFORMS Journal on Computing 17(3), pp. 263–289, © 2005 INFORMS
Cioppa, T. M., T. W. Lucas. 2005. Efficient nearly orthogonal and Jin, R., W. Chen, T. Simpson. 2001. Comparative studies of meta-
space-filling Latin hypercubes. Working paper, Department of modeling techniques under multiple modeling criteria. J. Struc-
Operations Research, Naval Postgraduate School, Monterey, tural Optim. 23 1–13.
CA. Jin, R., W. Chen, A. Sudjianto. 2002. On sequential sampling for
Clarke, S. M., J. H. Griebsch, T. W. Simpson. 2005. Analysis of sup- global metamodeling in engineering design. Proc. DETC ’02,
port vector regression for approximation of complex engineer- ASME 2002 Design Engrg. Tech. Conf. Comput. Inform. Engrg.
ing analyses. ASME J. Mech. Design 127. Conf. DETC2002/DAC-34092, Montreal, Canada, 1–10.
Crary Group. 2004. WebDOE. [Link] Kelton, W. D., R. M. Barton. 2003. Experimental design for simula-
Cressie, N. A. C. 1993. Statistics for Spatial Data. Revised ed. Wiley, tion. S. Chick, P. J. Sánchez, D. Ferrin, D. J. Morrice, eds. Proc.
New York. 2003 Winter Simulation Conf. Institute of Electrical and Electron-
Dean, A. M., S. M. Lewis. 2004. Screening. Springer-Verlag, New ics Engineers, Piscataway, NJ, 59–65.
York. Khuri, A. I. 1996. Multiresponse surface methodology. S. Ghosh,
Dewar, J. A., S. C. Bankes, J. S. Hodges, T. W. Lucas, D. K. Saunders- C. R. Rao, eds. Handbook of Statistics, Volume 13. Elsevier,
Newton, P. Vye. 1996. Credible uses of the distributed inter- Amsterdam, The Netherlands.
active simulation (DIS) system. MR-607-A, RAND Corpora- Kiefer, J. 1959. Optimal experimental designs (with comments).
tion, Santa Monica, CA, [Link] J. Roy. Statist. Soc. Ser. B 21 272–319.
MR/[Link]. Kleijnen, J. P. C. 1974–1975. Statistical Techniques in Simulation, Vol-
Donohue, J. M., E. C. Houck, R. H. Myers. 1993. Simulation designs umes I, II. Marcel Dekker, Inc., New York.
and correlation induction for reducing second-order bias in Kleijnen, J. P. C. 1987. Statistical Tools for Simulation Practitioners.
first-order response surfaces. Oper. Res. 41 880–902. Marcel Dekker, Inc., New York.
Fedorov, V. V. 1972. Theory of Optimal Experiments. Academic Press, Kleijnen, J. P. C. 1995. Case study: Statistical validation of simula-
New York. tion models. Eur. J. Oper. Res. 87 21–34.
Fu, M. C. 2002. Optimization for simulation: Theory vs. practice. Kleijnen, J. P. C. 1998. Design for sensitivity analysis, optimization,
INFORMS J. Comput. 14 192–215. and validation of simulation models. J. Banks, ed. Handbook
Gentle, J. E. 2002. Computational Statistics. Springer, New York. of Simulation: Principles, Methodology, Advances, Applications, and
Gill, A., D. Grieger. 2003. Comparison of agent based distillation Practice. Wiley, New York, 173–223.
movement algorithms. Military Oper. Res. 8 5–16. Kleijnen, J. P. C. 2005. An overview of the design and analysis
Giunta, A. A., L. T. Watson. 1998. A comparison of approximating of simulation experiments for sensitivity analysis. Eur. J. Oper.
modeling techniques: Polynomial versus interpolating models. Res. 164 287–300.
7th AIAA/USAF/NASA/ISSMO Sympos. Multidisciplinary Anal. Kleijnen, J. P. C., E. Gaury. 2003. Short-term robustness of produc-
Optim. St. Louis, MO, AIAA 98–4758, 381–391. tion management systems: A case study. Eur. J. Oper. Res. 148
Goldsman, D., S.-H. Kim, W. S. Marshall, B. L. Nelson. 2002. Rank- 452–465.
ing and selection for steady-state simulation: Procedures and Kleijnen, J. P. C., O. Pala. 1999. Maximizing the simulation output:
perspectives. INFORMS J. Comput. 14 2–19. A competition. Simulation 73 168–173.
Hayter, A. J. 1984. A proof of the conjecture that the Tukey-Kramer Kleijnen, J. P. C., R. G. Sargent. 2000. A methodology for the fitting
multiple comparisons procedure is conservative. Ann. Math. and validation of metamodels in simulation. Eur. J. Oper. Res.
Statist. 12 61–75. 120 14–29.
Helton, J. C., M. G. Marietta, eds. 2000. Special issue: 1996 perfor- Kleijnen, J. P. C., M. T. Smits. 2003. Performance metrics in supply
mance assessment for the waste isolation pilot plant. Reliability chain management. J. Oper. Res. Soc. 54 507–514.
Engrg. Systems Safety 69 1–454. Kleijnen, J. P. C., W. van Groenendaal. 1992. Simulation: A Statistical
Hodges, J. S. 1991. Six (or so) things you can do with a bad model. Perspective. Wiley, Chichester, UK.
Oper. Res. 39 355–365. Kleijnen, J. P. C., W. C. M. van Beers. 2004. Application-driven
Hodges, J. S., J. A. Dewar. 1992. Is it you or your model talking? sequential designs for simulation experiments: Kriging meta-
A framework for model validation. Report R-4114-AF/A/OSD, modeling. J. Oper. Res. Soc. 55 876–883.
RAND Corporation, Santa Monica, CA.
Kleijnen, J. P. C., W. C. M. van Beers. 2005. Robustness of Kriging
Holcomb, D., D. C. Montgomery, W. M. Carlyle. 2000a. Analysis of when interpolating in random simulation with heterogeneous
supersaturated designs. J. Quality Tech. 35 13–27. variances: Some experiments. Eur. J. Oper. Res. 165 826–834.
Holcomb, D., D. C. Montgomery, W. M. Carlyle. 2000b. Some Kleijnen, J. P. C., B. Bettonvil, F. Persson. 2003. Robust solutions
combinatorial aspects, construction methods, and evaluation for supply chain management: Simulation and risk analysis
criteria for supersaturated designs. Quality Reliability Engrg. of the Ericsson case study. Working paper, Tilburg University,
Internat. 18 299–304. Tilburg, The Netherlands.
Horne, G., S. Johnson, eds. 2002. Maneuver Warfare Science 2002. Kleijnen, J. P. C., B. Bettonvil, F. Persson. 2005. Finding the impor-
USMC Project Albert, Quantico, VA. tant factors in large discrete-event simulation: Sequential bifur-
Horne, G., S. Johnson, eds. 2003. Maneuver Warfare Science 2003. cation and its applications. A. M. Dean, S. M. Lewis, eds.
USMC Project Albert, Quantico, VA. Screening. Springer-Verlag, New York.
Horne, G., M. Leonardi, eds. 2001. Maneuver Warfare Science 2001. Kleijnen, J. P. C., R. C. H. Cheng, V. B. Melas. 2000. Optimal design
Marine Corps Combat Development Command, Defense Auto- of experiments with simulation models of nearly saturated
mated Printing Service, Quantico, VA. queues. J. Statist. Planning Inference 85 19–26.
Hsu, J. C. 1996. Multiple Comparisons; Theory and Methods. Chapman Kleijnen, J. P. C., A. J. Feelders, R. C. H. Cheng. 1998. Bootstrapping
& Hall, London. and validation of metamodels in simulation. D. J. Medeiros,
Irizarry, M. de los A., M. E. Kuhl, E. K. Lada, S. Subramanian, E. F. Watson, J. S. Carson, M. S. Manivannan, eds. Proc. 1998
J. R. Wilson. 2003. Analyzing transformation-based simulation Winter Simulation Conf. Institute of Electrical and Electronics
metamodels. IIE Trans. 35 271–283. Engineers, Piscataway, NJ, 701–706.
Jacobson, S., A. Buss, L. Schruben. 1991. Driving frequency selec- Kleijnen, J. P. C., A. J. van den Burg, R. T. H. van der Ham. 1979.
tion for frequency domain simulation experiments. Oper. Res. Generalization of simulation results: Practicality of statistical
39 917–924. methods. Eur. J. Oper. Res. 3 50–64.
Kleijnen et al.: A User’s Guide to the Brave New World of Designing Simulation Experiments
288 INFORMS Journal on Computing 17(3), pp. 263–289, © 2005 INFORMS
Kleijnen, J. P. C., A. Vonk Noordegraaf, M. Nielen. 2001. Sensitivity Pacheco, J., C. Amon, S. Finger. 2003. Incorporating Information
analysis of censored output through polynomial, logistic and from Replications into Bayesian Surrogate Models. 2003 ASME
tobit models: Theory and case study. B. A. Peters, J. S. Smith, Design Engrg. Tech. Conf. DETC2003/DTM-48644, Chicago, IL.
D. J. Medeiros, M. W. Rohrer, eds. Proc. 2001 Winter Simu- Pukelsheim, F. 1993. Optimal Design of Experiments. Wiley,
lation Conf. Institute of Electrical and Electronics Engineers, New York.
Piscataway, NJ, 486–491.
Ramberg, J. S., S. M. Sanchez, P. J. Sanchez, L. J. Hollick. 1991.
Koehler, J. R., A. B. Owen. 1996. Computer experiments. S. Ghosh, Designing simulation experiments: Taguchi methods and re-
C. R. Rao, eds. Handbook of Statistics, Volume 13. Elsevier, sponse surface metamodels. B. L. Nelson, W. D. Kelton,
Amsterdam, The Netherlands, 261–308. G. M. Clark, eds. Proc. 1991 Winter Simulation Conf. Institute of
Kutner, M. H., C. J. Nachtsheim, J. Neter, W. Li. 2004. Applied Linear Electrical and Electronics Engineers, Piscataway, NJ, 167–176.
Statistical Models, 5th ed. McGraw-Hill/Irwin, Boston, MA. Rao, C. R. 1967. Least squares theory using an estimated dispersion
Lauren, M. K., R. T. Stephen. 2002. Map-aware non-uniform matrix and its application to measurement of signals. Proc. 5th
automata—A New Zealand approach to scenario modelling. Berkeley Sympos. Math. Statist. Probab. I. University of California
J. Battlefield Tech. 5 27–31. Press, Berkeley, CA, 355–372.
Law, A. M., W. D. Kelton. 2000. Simulation Modeling and Analysis, Rechtschaffner, R. L. 1967. Saturated fractions of 2n and 3n factorial
3rd ed. McGraw-Hill, New York. designs. Technometrics 9 569–575.
Lin, D. K. J. 1995. Generating systematic supersaturated designs. Sacks, J., W. J. Welch, T. J. Mitchell, H. P. Wynn. 1989. Design and
Technometrics 37 213–225. analysis of computer experiments (includes Comments and
Lophaven, S. N., H. B. Nielsen, J. Sondergaard. 2002. DACE: Rejoinder). Statist. Sci. 4 409–435.
A Matlab Kriging toolbox, version [Link] Technical Univer- Saltelli, A., S. Tarantola, P. S. Chan. 1999. A quantitative model-
sity of Denmark, Lyngby, Denmark, [Link] independent method for global sensitivity analysis of model
∼hbn/dace/. output. Technometrics 41 39–56.
Lucas, T. W., S. C. Bankes, P. Vye. 1997. Improving the analytic con- Sanchez, P. J., A. H. Buss. 1987. A model for frequency domain
tribution of advanced warfighting experiments (AWEs). Docu- experiments. A. Thesen, H. Grant, W. D. Kelton, eds. Proc. 1987
mented Briefing DB-207-A. RAND Corporation, Santa Monica, Winter Simulation Conf. Institute of Electrical and Electronics
CA. Engineers, Piscataway, NJ, 424–427.
Lucas, T. W., S. M. Sanchez, L. Brown, W. Vinyard. 2002. Bet- Sanchez, S. M. 2000. Robust design: Seeking the best of all possi-
ter designs for high-dimensional explorations of distillations. ble worlds. J. A. Joines, R. R. Barton, K. Kang, P. A. Fishwick,
G. Horne, S. Johnson, eds. Maneuver Warfare Science 2002. eds. Proc. 2000 Winter Simulation Conf. Institute of Electrical and
USMC Project Albert, Quantico, VA, 17–46. Electronics Engineers, Piscataway, NJ, 69–76.
Lucas, T. W., S. M. Sanchez, T. M. Cioppa, A. I. Ipekci. 2003. Sanchez, S. M., T. W. Lucas. 2002. Exploring the world of
Generating hypotheses on fighting the global war on terror- agent-based simulations: Simple models, complex analyses.
ism. G. Horne, S. Johnson, eds. Maneuver Warfare Science 2003. E. Yücesan, C.-H. Chen, J. L. Snowdon, J. Charnes, eds. Proc.
USMC Project Albert, Quantico, VA, 117–137. 2002 Winter Simulation Conf. Institute of Electrical and Electron-
Martinez, W. L., A. R. Martinez. 2002. Computational Statistics Hand- ics Engineers, Piscataway, NJ, 116–126.
book with MATLAB. Chapman & Hall/CRC, Boca Raton, FL. Sanchez, S. M., H.-F. Wu. 2003. Frequency-based designs for termi-
McKay, M. D., R. J. Beckman, W. J. Conover. 1979. A compari- nating simulations: A peace-enforcement application. S. Chick,
son of three methods for selecting values of input variables in P. J. Sánchez, D. Ferrin, D. J. Morrice, eds. Proc. 2003 Win-
the analysis of output from a computer code. Technometrics 21 ter Simulation Conf. Institute of Electrical and Electronics Engi-
239–245. neers, Piscataway, NJ, 952–959.
Moeeni, F., S. M. Sanchez, A. J. Vakharia. 1997. A robust design Sanchez, S. M., P. J. Sanchez, J. S. Ramberg. 1998. A simulation
methodology for Kanban system design. Internat. J. Production framework for robust system design. B. Wang, ed. Concurrent
Res. 35 2821–2838. Design of Products, Manufacturing Processes and Systems. Gordon
Montgomery, D. C. 2000. Design and Analysis of Experiments, 5th ed. and Breach, New York, 279–314.
Wiley, New York. Sanchez, S. M., L. D. Smith, E. C. Lawrence. 2005. Tolerance design
Morrice, D. J., I. R. Bardhan. 1995. A weighted least squares revisited: Assessing the impact of correlated noise factors.
approach to computer simulation factor screening. Oper. Res. Working paper, Operations Research Department, Naval Post-
43 792–806. graduate School, Monterey, CA.
Morris, M. D., T. J. Mitchell. 1995. Exploratory designs for compu- Sanchez, S. M., P. J. Sanchez, J. S. Ramberg, F. Moeeni. 1996. Effec-
tational experiments. J. Statist. Planning Inference 43 381–402. tive engineering design through simulation. Internat. Trans.
Oper. Res. 3 169–185.
Meyer, T., S. Johnson. 2001. Visualization for data farming: A sur-
vey of methods. G. Horne, M. Leonardi, eds. Maneuver Warfare Santner, T. J., B. J. Williams, W. I. Notz. 2003. The Design and Analysis
Science 2001. Marine Corps Combat Development Command, of Computer Experiments. Springer-Verlag, New York.
Quantico, VA, 15–30. SAS Institute, Inc. 2004. JMP IN® Version 5.1 for Windows, Macintosh,
Myers, R. H., D. C. Montgomery. 2002. Response Surface Methodology: and Unix. Duxbury Thompson Learning, Pacific Grove, CA.
Process and Product Optimization using Designed Experiments, 2nd Sasena, M. J., P. Y. Papalambros, P. Goovaerts. 2002. Exploration
ed. Wiley, New York. of metamodeling sampling criteria for constrained global opti-
Myers, R. H., A. I. Khuri, G. Vining. 1992. Response surface alterna- mization. Engrg. Optim. 34 263–278.
tives to the Taguchi robust design parameter approach. Amer. Schmeiser, B. W. 1982. Batch size effects in the analysis of simula-
Statist. 46 131–139. tion output. Oper. Res. 30 556–568.
Nakayama, M. 2003. Analysis of simulation output. S. Chick, Schruben, L. W., V. J. Cogliano. 1987. An experimental procedure
P. J. Sánchez, D. Ferrin, D. J. Morrice, eds. Proc. 2003 Win- for simulation response surface model identification. Comm.
ter Simulation Conf. Institute of Electrical and Electronics Engi- ACM 30 716–730.
neers, Piscataway, NJ, 49–58. Shoemaker, A. C., K.-L. Tsui, C. F. J. Wu. 1991. Economical
Nelson, B. L., D. Goldsman. 2001. Comparisons with a standard in experimentation methods for robust design. Technometrics 33
simulation experiments. Management Sci. 47 449–463. 415–427.
Kleijnen et al.: A User’s Guide to the Brave New World of Designing Simulation Experiments
INFORMS Journal on Computing 17(3), pp. 263–289, © 2005 INFORMS 289
Simon, H. A. 1981. The Sciences of the Artificial, 2nd ed. MIT Press, B. A. Peters, eds. Proc. 2004 Winter Simulation Conf. Institute of
Cambridge, MA. Electrical and Electronics Engineers, Piscataway, NJ, 113–121.
Simpson, T. W., D. K. J. Lin, W. Chen. 2001. Sampling strategies for Vinyard, W., T. W. Lucas. 2002. Exploring combat models for non-
computer experiments: Design and analysis. Internat. J. Relia- monotonicities and remedies. PHALANX 35 19, 36–38.
bility Appl. 2 209–240. Vonk Noordegraaf, A., M. Nielen, J. P. C. Kleijnen. 2003. Sensitiv-
Smith, L. D., S. M. Sanchez. 2003. Assessment of business poten- ity analysis by experimental design and metamodelling: Case
tial at retail sites: Empirical findings from a U. S. supermar- study on simulation in national animal disease control. Eur. J.
ket chain. Internat. Rev. Retail, Distribution Consumer Res. 13 Oper. Res. 146 433–443.
37–58. Wan, H., B. Ankenman, B. L. Nelson. 2003. Controlled sequential
Spall, J. C. 2003. Introduction to Stochastic Search and Optimization; bifurcation: A new factor-screening method for discrete-event
Estimation, Simulation, and Control. Wiley, New York. simulation. S. Chick, P. J. Sánchez, D. Ferrin, D. J. Morrice,
Steiger, N. M., E. K. Lada, J. R. Wilson, J. A. Joines, C. Alexopoulos, eds. Proc. 2003 Winter Simulation Conf. Institute of Electrical and
D. Goldsman. 2005. ASAP3: A batch means procedure for Electronics Engineers, Piscataway, NJ, 565–573.
steady-state simulation output analysis. ACM Trans. Model. Wan, S. C. 2002. An exploratory analysis on the effects of human
Comput. Simulation 15 39–73. factors on combat outcomes. M. S. thesis, Operations Research
Sugiyama, S. O., J. W. Chow. 1997. @Risk, riskview and bestFit. Department, Naval Postgraduate School, Monterey, CA,
OR/MS Today 24 64–66. [Link]
Taguchi, G. 1987. System of Experimental Designs, Vol. 1, 2. Watson, A. G., R. J. Barnes. 1995. Infill sampling criteria to locate
UNIPUB/Krauss International, White Plains, NY. extremes. Math. Geology 27 589–608.
Trocine, L., L. C. Malone. 2001. An overview of newer, advanced Wegman, E. M. 1990. Hyperdimensional data analysis using paral-
screening methods for the initial phase in an experimental lel coordinates. J. Amer. Statist. Association 85 664–674.
design. B. A. Peters, J. S. Smith, D. J. Medeiros, M. W. Rohrer, Wolf, E. S. 2003. Using agent-based distillations to explore logistics
eds. Proc. 2001 Winter Simulation Conf. Institute of Electrical and support to urban, humanitarian assistance/disaster relief oper-
Electronics Engineers, Piscataway, NJ, 169–178. ations. M.S. thesis, Operations Research Department, Naval
Tufte, E. R. 1990. Envisioning Information. Graphics Press, Cheshire, Postgraduate School, Monterey, CA.
CT. Wolf, E. S., S. M. Sanchez, N. Goerger, L. Brown. 2003. Using
Tunali, S., I. Batmaz. 2000. Dealing with the least squares regres- agents to model logistics. Working paper, Operations Research
sion assumptions in simulation metamodeling. Comput. Indust. Department, Naval Postgraduate School, Monterey, CA.
Engrg. 38 307–320. Ye, K. Q. 1998. Orthogonal column Latin hypercubes and their
van Beers, W. C. M., J. P. C. Kleijnen. 2003. Kriging for interpolation application in computer experiments. J. Amer. Statist. Associa-
in random simulation. J. Oper. Res. Soc. 54 255–262. tion 93 1430–1439, [Link]
van Beers, W. C. M., J. P. C. Kleijnen. 2004. Kriging interpolation in Zeigler, B. P., K. Praehofer, T. G. Kim. 2000. Theory of Modeling and
simulation: A survey. R. G. Ingalls, M. D. Rossetti, J. S. Smith, Simulation, 2nd ed. Academic Press, San Diego, CA.