Software Testing and Analysis
Software Testing and Analysis
and Techniques
byMauro PezzandMichal Young
John Wiley & Sons 2008 (510 pages)
ISBN:9780471455936
Providing students and professionals with strategies for reliable,
cost-effective software development, this guide covers a full
spectrum of topics from basic principles and underlying theory to
organizational and process issues in real-world application.
Table of Contents
Software Testing and Analysis-Process, Principles, and Techniques
Preface
Part I - Fundamentals of Test and Analysis
Chapter 1 - Software Test and Analysis in a Nutshell
Chapter 2 - A Framework for Test and Analysis
Chapter 3 - Basic Principles
Chapter 4 - Test and Analysis Activities Within a Software Process
Part II - Basic Techniques
Chapter 5 - Finite Models
Chapter 6 - Dependence and Data Flow Models
Chapter 7 - Symbolic Execution and Proof of Properties
Chapter 8 - Finite State Verification
Part III - Problems and Methods
Chapter 9 - Test Case Selection and Adequacy
Chapter 10 - Functional Testing
Chapter 11 - Combinatorial Testing
Chapter 12 - Structural Testing
Chapter 13 - Data Flow Testing
Chapter 14 - Model-Based Testing
Chapter 15 - Testing Object-Oriented Software
Chapter 16 - Fault-Based Testing
Chapter 17 - Test Execution
Chapter 18 - Inspection
Chapter 19 - Program Analysis
Part IV - Process
Chapter 20 - Planning and Monitoring the Process
Chapter 21 - Integration and Component-based Software Testing
Chapter 22 - System, Acceptance, and Regression Testing
Chapter 23 - Automating Analysis and Test
Chapter 24 - Documenting Analysis and Test
Bibliography
Index
List of Figures
List of Tables
List of Sidebars
Back Cover
You can't test quality into a software product, but neither can you
build a quality software product without test and analysis. Software
test and analysis is increasingly recognized, in research and in
industrial practice, as a core challenge in software engineering and
computer science. Software Testing and Analysis: Process,
Principles, and Techniques is the first book to present a range of
complementary software test and analysis techniques in an
integrated, coherent fashion. It covers a full spectrum of topics from
basic principles and underlying theory to organizational and process
issues in real-world application. The emphasis throughout is on
selecting a complementary set of practical techniques to achieve an
acceptable level of quality at an acceptable cost.
Highlights of the book include
Interplay among technical and non-technical issues in crafting
an approach to software quality, with chapters devoted to
planning and monitoring the software quality process.
A selection of practical techniques ranging from inspection to
automated program and design analyses to unit, integration,
system, and regression testing, with technical material set in
the context of real-world problems and constraints in software
development.
A coherent view of the state of the art and practice, with
technical and organizational approaches to push the state of
practice toward the state of the art.
Throughout, the text covers techniques that are suitable for nearterm application, with sufficient technical background to help you
know how and when to apply them. Exercises reinforce the
instruction and ensure that you master each topic before
proceeding.
By incorporating software testing and analysis techniques into
modern practice, Software Testing and Analysis: Process, Principles,
Preface
Overview
This book addresses software test and analysis in the context of an
overall effort to achieve quality. It is designed for use as a primary
textbook for a course in software test and analysis or as a
supplementary text in a software engineering course, and as a
resource for software developers.
The main characteristics of this book are:
It assumes that the reader's goal is to achieve a suitable balance of cost, schedule,
and quality. It is not oriented toward critical systems for which ultra-high reliability
must be obtained regardless of cost, nor will it be helpful if one's aim is to cut cost or
schedule regardless of consequence.
It presents a selection of techniques suitable for near-term application, with sufficient
technical background to understand their domain of applicability and to consider
variations to suit technical and organizational constraints. Techniques of only
historical interest and techniques that are unlikely to be practical in the near future
are omitted.
It promotes a vision of software testing and analysis as integral to modern software
engineering practice, equally as important and technically demanding as other
aspects of development. This vision is generally consistent with current thinking on
the subject, and is approached by some leading organizations, but is not universal.
It treats software testing and static analysis techniques together in a coherent
framework, as complementary approaches for achieving adequate quality at
acceptable cost.
Combinatorial Testing
14
Model-Based Testing
15
17
Test Execution
18
Inspection
19
Program Analysis
20
Open table as spreadsheet
Chapter List
Chapter 1: Software Test and Analysis in a Nutshell
Chapter 2: A Framework for Test and Analysis
Chapter 3: Basic Principles
Chapter 4: Test and Analysis Activities Within a Software Process
specialized tests and analyses designed particularly for the case at hand.
Software is among the most variable and complex of artifacts engineered on a regular
basis. Quality requirements of software used in one environment may be quite different
and incompatible with quality requirements of a different environment or application
domain, and its structure evolves and often deteriorates as the software system grows.
Moreover, the inherent nonlinearity of software systems and uneven distribution of faults
complicates verification. If an elevator can safely carry a load of 1000 kg, it can also
safely carry any smaller load, but if a procedure correctly sorts a set of 256 elements, it
may fail on a set of 255 or 53 or 12 elements, as well as on 257 or 1023.
The cost of software verification often exceeds half the overall cost of software
development and maintenance. Advanced development technologies and powerful
supporting tools can reduce the frequency of some classes of errors, but we are far
from eliminating errors and producing fault-free software. In many cases new
development approaches introduce new subtle kinds of faults, which may be more
difficult to reveal and remove than classic faults. This is the case, for example, with
distributed software, which can present problems of deadlock or race conditions that are
not present in sequential programs. Likewise, object-oriented development introduces
new problems due to the use of polymorphism, dynamic binding, and private state that
are absent or less pronounced in procedural software.
The variety of problems and the richness of approaches make it challenging to choose
and schedule the right blend of techniques to reach the required level of quality within
cost constraints. There are no fixed recipes for attacking the problem of verifying a
software product. Even the most experienced specialists do not have pre-cooked
solutions, but need to design a solution that suits the problem, the requirements, and the
development environment.
90% of all statements are executed by the functional tests, this is taken as an indication
that either the interface specifications are incomplete (if the missing coverage
corresponds to visible differences in behavior), or else additional implementation
complexity hides behind the interface. Either way, additional test cases are devised
based on a more complete description of unit behavior.
Integration and system tests are generated by the quality team, working from a catalog
of patterns and corresponding tests. The behavior of some subsystems or components
is modeled as finite state machines, so the quality team creates test suites that exercise
program paths corresponding to each state transition in the models.
Scaffolding and oracles for integration testing are part of the overall system architecture.
Oracles for individual components and units are designed and implemented by
programmers using tools for annotating code with conditions and invariants. The
Chipmunk developers use a home-grown test organizer tool to bind scaffolding to code,
schedule test runs, track faults, and organize and update regression test suites.
The quality plan includes analysis and test activities for several properties distinct from
functional correctness, including performance, usability, and security. Although these are
an integral part of the quality plan, their design and execution are delegated in part or
whole to experts who may reside elsewhere in the organization. For example, Chipmunk
maintains a small team of human factors experts in its software division. The human
factors team will produce look-and-feel guidelines for the Web purchasing system, which
together with a larger body of Chipmunk interface design rules can be checked during
inspection and test. The human factors team also produces and executes a usability
testing plan.
Parts of the portfolio of verification and validation activities selected by Chipmunk are
illustrated in Figure 1.1. The quality of the final product and the costs of the quality
assurance activities depend on the choice of the techniques to accomplish each activity.
Most important is to construct a coherent plan that can be monitored. In addition to
monitoring schedule progress against the plan, Chipmunk records faults found during
each activity, using this as an indicator of potential trouble spots. For example, if the
number of faults found in a component during design inspections is high, additional
dynamic test time will be planned for that component.
Figure 1.1: Main analysis and testing activities through the software life
cycle.
in its organization, it is fruitful to begin measuring reliability when debug testing is yielding
less than one fault ("bug") per day of tester time. For some application domains, Chipmunk
has gathered a large amount of historical usage data from which to define an operational
profile, and these profiles can be used to generate large, statistically valid sets of randomly
generated tests. If the sample thus tested is a valid model of actual executions, then
projecting actual reliability from the failure rate of test cases is elementary. Unfortunately, in
many cases such an operational profile is not available.
Chipmunk has an idea of how the Web sales facility will be used, but it cannot construct and
validate a model with sufficient detail to obtain reliability estimates from a randomly
generated test suite. They decide, therefore, to use the second major approach to verifying
reliability, using a sample of real users. This is commonly known as alpha testing if the tests
are performed by users in a controlled environment, observed by the development
organization. If the tests consist of real users in their own environment, performing actual
tasks without interference or close monitoring, it is known as beta testing. The Chipmunk
team plans a very small alpha test, followed by a longer beta test period in which the
software is made available only in retail outlets. To accelerate reliability measurement after
subsequent revisions of the system, the beta test version will be extensively instrumented,
capturing many properties of a usage profile.
Summary
The quality process has three distinct goals: improving a software product (by
preventing, detecting, and removing faults), assessing the quality of the software
product (with respect to explicit quality goals), and improving the long-term quality and
cost- effectiveness of the quality process itself. Each goal requires weaving quality
assurance and improvement activities into an overall development process, from product
inception through deployment, evolution, and retirement.
Each organization must devise, evaluate, and refine an approach suited to that
organization and application domain. A well-designed approach will invariably combine
several test and analysis techniques, spread across stages of development. An array of
fault detection techniques are distributed across development stages so that faults are
removed as soon as possible. The overall cost and cost-effectiveness of techniques
depends to a large degree on the extent to which they can be incrementally re-applied
as the product evolves.
Further Reading
This book deals primarily with software analysis and testing to improve and assess the
dependability of software. That is not because qualities other than dependability are
unimportant, but rather because they require their own specialized approaches and
techniques. We offer here a few starting points for considering some other important
properties that interact with dependability. Norman's The Design of Everyday Things
[Nor90] is a classic introduction to design for usability, with basic principles that apply to
both hardware and software artifacts. A primary reference on usability for interactive
computer software, and particularly for Web applications, is Nielsen's Designing Web
Usability [Nie00]. Bishop's text Computer Security: Art and Science [Bis02] is a good
introduction to security issues. The most comprehensive introduction to software safety
is Leveson's Safeware [Lev95].
Exercises
Philip has studied "just-in-time" industrial production methods and is convinced that
they should be applied to every aspect of software development. He argues that
1.1 test case design should be performed just before the first opportunity to execute
the newly designed test cases, never earlier. What positive and negative
consequences do you foresee for this just-in-time test case design approach?
A newly hired project manager at Chipmunk questions why the quality manager is
involved in the feasibility study phase of the project, rather than joining the team
1.2 only when the project has been approved, as at the new manager's previous
company. What argument(s) might the quality manager offer in favor of her
involvement in the feasibility study?
Chipmunk procedures call for peer review not only of each source code module,
but also of test cases and scaffolding for testing that module. Anita argues that
inspecting test suites is a waste of time; any time spent on inspecting a test case
designed to detect a particular class of fault could more effectively be spent
1.3 inspecting the source code to detect that class of fault. Anita's project manager, on
the other hand, argues that inspecting test cases and scaffolding can be costeffective when considered over the whole lifetime of a software product. What
argument(s) might Anita's manager offer in favor of this conclusion?
The spiral model of software development prescribes sequencing incremental
prototyping phases for risk reduction, beginning with the most important project
risks. Architectural design for testability involves, in addition to defining testable
1.4 interface specifications for each major module, establishing a build order that
supports thorough testing after each stage of construction. How might spiral
development and design for test be complementary or in conflict?
You manage an online service that sells downloadable video recordings of classic
movies. A typical download takes one hour, and an interrupted download must be
restarted from the beginning. The number of customers engaged in a download at
1.5 any given time ranges from about 10 to about 150 during peak hours. On average,
your system goes down (dropping all connections) about two times per week, for
an average of three minutes each time. If you can double availability or double
mean time between failures, but not both, which will you choose? Why?
Having no a priori operational profile for reliability measurement, Chipmunk will
depend on alpha and beta testing to assess the readiness of its online purchase
functionality for public release. Beta testing will be carried out in retail outlets, by
1.6 retail store personnel, and then by customers with retail store personnel looking on.
How might this beta testing still be misleading with respect to reliability of the
software as it will be used at home and work by actual customers? What might
Chipmunk do to ameliorate potential problems from this reliability misestimation?
1.7
The junior test designers of Chipmunk Computers are annoyed by the procedures
for storing test cases together with scaffolding, test results, and related
documentation. They blame the extra effort needed to produce and store such data
for delays in test design and execution. They argue for reducing the data to store
to the minimum required for reexecuting test cases, eliminating details of test
documentation, and limiting test results to the information needed for generating
oracles. What argument(s) might the quality manager use to convince the junior
test designers of the usefulness of storing all this information?
Figure 2.1: Validation activities check work products against actual user
requirements, while verification activities check consistency of work
products.
Validation activities refer primarily to the overall system specification and the final code.
With respect to overall system specification, validation checks for discrepancies
between actual needs and the system specification as laid out by the analysts, to ensure
that the specification is an adequate guide to building a product that will fulfill its goals.
With respect to final code, validation aims at checking discrepancies between actual
need and the final product, to reveal possible failures of the development process and to
make sure the product meets end-user expectations. Validation checks between the
specification and final product are primarily checks of decisions that were left open in the
specification (e.g., details of the user interface or product features). Chapter 4 provides
a more thorough discussion of validation and verification activities in particular software
process models.
We have omitted one important set of verification checks from Figure 2.1 to avoid
clutter. In addition to checks that compare two or more artifacts, verification includes
checks for self-consistency and well-formedness. For example, while we cannot judge
that a program is "correct" except in reference to a specification of what it should do, we
can certainly determine that some programs are "incorrect" because they are ill- formed.
We may likewise determine that a specification itself is ill-formed because it is
inconsistent (requires two properties that cannot both be true) or ambiguous (can be
interpreted to require some property or not), or because it does not satisfy some other
well-formedness constraint that we impose, such as adherence to a standard imposed
by a regulatory agency.
Validation against actual requirements necessarily involves human judgment and the
potential for ambiguity, misunderstanding, and disagreement. In contrast, a specification
should be sufficiently precise and unambiguous that there can be no disagreement about
whether a particular system behavior is acceptable. While the term testing is often used
informally both for gauging usefulness and verifying the product, the activities differ in
both goals and approach. Our focus here is primarily on dependability, and thus primarily
on verification rather than validation, although techniques for validation and the relation
between the two is discussed further in Chapter 22.
Dependability properties include correctness, reliability, robustness, and safety.
Correctness is absolute consistency with a specification, always and in all
circumstances. Correctness with respect to nontrivial specifications is almost never
achieved. Reliability is a statistical approximation to correctness, expressed as the
likelihood of correct behavior in expected use. Robustness, unlike correctness and
reliability, weighs properties as more and less critical, and distinguishes which properties
should be maintained even under exceptional circumstances in which full functionality
cannot be maintained. Safety is a kind of robustness in which the critical property to be
maintained is avoidance of particular hazardous behaviors. Dependability properties are
discussed further in Chapter 4.
[1]A
part of the diagram is a variant of the well-known "V model" of verification and
validation.
construct a logical proof. How long would this take? If we ignore implementation details
such as the size of the memory holding a program and its data, the answer is "forever."
That is, for most programs, exhaustive testing cannot be completed in any finite amount
of time.
Suppose we do make use of the fact that programs are executed on real machines with
finite representations of memory values. Consider the following trivial Java class:
1 class Trivial{
2
static int sum(int a, int b) { return a+b; }
3 }
The Java language definition states that the representation of an int is 32 binary digits,
and thus there are only 232 232 = 264 1021 different inputs on which the method
Trivial.sum() need be tested to obtain a proof of its correctness. At one nanosecond
(109 seconds) per test case, this will take approximately 1012 seconds, or about
30,000 years.
A technique for verifying a property can be inaccurate in one of two directions (Figure
2.2). It may be pessimistic, meaning that it is not guaranteed to accept a program even
if the program does possess the property being analyzed, or it can be optimistic if it
may accept some programs that do not possess the property (i.e., it may not detect all
violations). Testing is the classic optimistic technique, because no finite number of tests
can guarantee correctness. Many automated program analysis techniques for properties
of program behaviors[3] are pessimistic with respect to the properties they are designed
to verify. Some analysis techniques may give a third possible answer, "don't know." We
can consider these techniques to be either optimistic or pessimistic depending on how
we interpret the "don't know" result. Perfection is unobtainable, but one can choose
techniques that err in only a particular direction.
int i, sum;
int first=1;
for (i=0; i<10; ++i) {
if (first) {
sum=0; first=0;
}
sum += i;
}
It is impossible in general to determine whether each control flow path can be executed,
and while a human will quickly recognize that the variable sum is initialized on the first
iteration of the loop, a compiler or other static analysis tool will typically not be able to
rule out an execution in which the initialization is skipped on the first iteration. Java neatly
solves this problem by making code like this illegal; that is, the rule is that a variable
must be initialized on all program control paths, whether or not those paths can ever be
executed.
Software developers are seldom at liberty to design new restrictions into the
programming languages and compilers they use, but the same principle can be applied
through external tools, not only for programs but also for other software artifacts.
Consider, for example, the following condition that we might wish to impose on
requirements documents:
1. Each significant domain term shall appear with a definition in the glossary of
the document.
This property is nearly impossible to check automatically, since determining whether a
particular word or phrase is a "significant domain term" is a matter of human judgment.
Moreover, human inspection of the requirements document to check this requirement will
be extremely tedious and error-prone. What can we do? One approach is to separate
the decision that requires human judgment (identifying words and phrases as
"significant") from the tedious check for presence in the glossary.
Each significant domain term shall be set off in the requirements document by the
1.a use of a standard style term. The default visual representation of the term style is a
single underline in printed documents and purple text in on-line displays.
Each word or phrase in the term style shall appear with a definition in the glossary
1.b of the document.
Property (1a) still requires human judgment, but it is now in a form that is much more
amenable to inspection. Property (1b) can be easily automated in a way that will be
completely precise (except that the task of determining whether definitions appearing in
the glossary are clear and correct must also be left to humans).
As a second example, consider a Web-based service in which user sessions need not
directly interact, but they do read and modify a shared collection of data on the server.
In this case a critical property is maintaining integrity of the shared data. Testing for this
property is notoriously difficult, because a "race condition" (interference between writing
data in one process and reading or writing related data in another process) may cause
an observable failure only very rarely.
Fortunately, there is a rich body of applicable research results on concurrency control
that can be exploited for this application. It would be foolish to rely primarily on direct
testing for the desired integrity properties. Instead, one would choose a (well- known,
formally verified) concurrency control protocol, such as the two-phase locking protocol,
and rely on some combination of static analysis and program testing to check
conformance to that protocol. Imposing a particular concurrency control protocol
substitutes a much simpler, sufficient property (two-phase locking) for the complex
property of interest (serializability), at some cost in generality; that is, there are
programs that violate two-phase locking and yet, by design or dumb luck, satisfy
serializability of data access.
It is a common practice to further impose a global order on lock accesses, which again
simplifies testing and analysis. Testing would identify execution sequences in which data
is accessed without proper locks, or in which locks are obtained and relinquished in an
order that does not respect the two-phase protocol or the global lock order, even if data
integrity is not violated on that particular execution, because the locking protocol failure
indicates the potential for a dangerous race condition in some other execution that might
occur only rarely or under extreme load.
With the adoption of coding conventions that make locking and unlocking actions easy to
recognize, it may be possible to rely primarily on flow analysis to determine
conformance with the locking protocol, with the role of dynamic testing reduced to a
"back-up" to raise confidence in the soundness of the static analysis. Note that the
critical decision to impose a particular locking protocol is not a post-hoc decision that
can be made in a testing "phase" at the end of development. Rather, the plan for
verification activities with a suitable balance of cost and assurance is part of system
design.
[3]Why
throughout development, but to the extent possible will also attempt to codify usability
guidelines in a form that permits verification. For example, if the usability group determines
that the fuel gauge should always be visible when the fuel level is below a quarter of a tank,
then this becomes a specified property that is subject to verification. The graphical interface
also poses a challenge in effectively checking output. This must be addressed partly in the
architectural design of the system, which can make automated testing feasible or not
depending on the interfaces between high-level operations (e.g., opening or closing a
window, checking visibility of a window) and low-level graphical operations and
representations.
Summary
Verification activities are comparisons to determine the consistency of two or more
software artifacts, or self-consistency, or consistency with an externally imposed criterion.
Verification is distinct from validation, which is consideration of whether software fulfills its
actual purpose. Software development always includes some validation and some
verification, although different development approaches may differ greatly in their relative
emphasis.
Precise answers to verification questions are sometimes difficult or impossible to obtain, in
theory as well as in practice. Verification is therefore an art of compromise, accepting some
degree of optimistic inaccuracy (as in testing) or pessimistic inaccuracy (as in many static
analysis techniques) or choosing to check a property that is only an approximation of what
we really wish to check. Often the best approach will not be exclusive reliance on one
technique, but careful choice of a portfolio of test and analysis techniques selected to obtain
acceptable results at acceptable cost, and addressing particular challenges posed by
characteristics of the application domain or software.
Further Reading
The "V" model of verification and validation (of which Figure 2.1 is a variant) appears in
many software engineering textbooks, and in some form can be traced at least as far back
as Myers' classic book [Mye79]. The distinction between validation and verification as given
here follow's Boehm [Boe81], who has most memorably described validation as "building
the right system" and verification as "building the system right."
Exercises
2.1
The Chipmunk marketing division is worried about the start-up time of the new
version of the RodentOS operating system (an (imaginary) operating system of
Chipmunk). The marketing division representative suggests a software requirement
stating that the start-up time shall not be annoying to users.
Explain why this simple requirement is not verifiable and try to reformulate the
requirement to make it verifiable.
2.2
2.3
Analysis and testing (A&T) has been common practice since the
earliest software projects. A&T activities were for a long time based
on common sense and individual skills. It has emerged as a distinct
discipline only in the last three decades.
This chapter advocates six principles that characterize various
approaches and techniques for analysis and testing: sensitivity,
redundancy, restriction, partition, visibility, and feedback. Some of
these principles, such as partition, visibility, and feedback, are quite
general in engineering. Others, notably sensitivity, redundancy, and
restriction, are specific to A&T and contribute to characterizing A&T
as a discipline.
3.1 Sensitivity
Human developers make errors, producing faults in software. Faults
may lead to failures, but faulty software may not fail on every
execution. The sensitivity principle states that it is better to fail
every time than sometimes.
Consider the cost of detecting and repairing a software fault. If it is detected immediately
(e.g., by an on-the-fly syntactic check in a design editor), then the cost of correction is very
small, and in fact the line between fault prevention and fault detection is blurred. If a fault is
detected in inspection or unit testing, the cost is still relatively small. If a fault survives initial
detection efforts at the unit level, but triggers a failure detected in integration testing, the
cost of correction is much greater. If the first failure is detected in system or acceptance
testing, the cost is very high indeed, and the most costly faults are those detected by
customers in the field.
A fault that triggers a failure on every execution is unlikely to survive past unit testing. A
characteristic of faults that escape detection until much later is that they trigger failures only
rarely, or in combination with circumstances that seem unrelated or are difficult to control.
For example, a fault that results in a failure only for some unusual configurations of
customer equipment may be difficult and expensive to detect. A fault that results in a failure
randomly but very rarely - for example, a race condition that only occasionally causes data
corruption - may likewise escape detection until the software is in use by thousands of
customers, and even then be difficult to diagnose and correct.
The small C program in Figure 3.1 has three faulty calls to string
copy procedures. The call to strcpy, strncpy, and stringCopy all pass
a source string "Muddled," which is too long to fit in the array
middle. The vulnerability of strcpy is well known, and is the culprit in
the by-now-standard buffer overflow attacks on many network
services. Unfortunately, the fault may or may not cause an
observable failure depending on the arrangement of memory (in this
case, it depends on what appears in the position that would be
middle[7], which will be overwritten with a newline character). The
standard recommendation is to use strncpy in place of strcpy. While
strncpy avoids overwriting other memory, it truncates the input
without warning, and sometimes without properly null-terminating
the output. The replacement function stringCopy, on the other hand,
uses an assertion to ensure that, if the target string is too long, the
program always fails in an observable manner.
1 /**
2 * Worse than broken: Are you feeling lucky?
3 */
4
5 #include <assert.h>
6
7
char before[ ] = "=Before=";
8
char middle[ ] = "Middle";
9
char after[ ] = "=After=";
10
11 void show() {
12
printf("%s\n%s\n%s\n", before, middle, after);
13 }
14
15 void stringCopy(char *target, const char *source, int howBi
16
17 int main(int argc, char *argv) {
18
show();
19
strcpy(middle, "Muddled"); /* Fault, but may not fail */
20
show();
21
strncpy(middle, "Muddled", sizeof(middle)); /* Fault, may
22
show();
23
stringCopy(middle, "Muddled",sizeof(middle)); /* Guarante
24
show();
25 }
26
27 /* Sensitive version of strncpy; can be counted on to fail
28 * in an observable way EVERY time the source is too large
29 * for the target, unlike the standard strncpy or strcpy.
30 */
31 void stringCopy(char *target, const char *source, int howBi
32
assert(strlen(source) < howBig);
33
strcpy(target, source);
34 }
Similarly, skilled test designers can derive excellent test suites, but
the quality of the test suites depends on the mood of the designers.
Systematic testing criteria may not do better than skilled test
designers, but they can reduce the influence of external factors,
such as the tester's mood.
[1]Existence of a general, reliable test coverage criterion would allow
3.2 Redundancy
Redundancy is the opposite of independence. If one part of a software artifact
(program, design document, etc.) constrains the content of another, then they are not
entirely independent, and it is possible to check them for consistency.
The concept and definition of redundancy are taken from information theory. In
communication, redundancy can be introduced into messages in the form of errordetecting and error-correcting codes to guard against transmission errors. In software
test and analysis, we wish to detect faults that could lead to differences between
intended behavior and actual behavior, so the most valuable form of redundancy is in the
form of an explicit, redundant statement of intent.
Where redundancy can be introduced or exploited with an automatic, algorithmic check
for consistency, it has the advantage of being much cheaper and more thorough than
dynamic testing or manual inspection. Static type checking is a classic application of this
principle: The type declaration is a statement of intent that is at least partly redundant
with the use of a variable in the source code. The type declaration constrains other parts
of the code, so a consistency check (type check) can be applied.
An important trend in the evolution of programming languages is introduction of additional
ways to declare intent and automatically check for consistency. For example, Java
enforces rules about explicitly declaring each exception that can be thrown by a method.
Checkable redundancy is not limited to program source code, nor is it something that
can be introduced only by programming language designers. For example, software
design tools typically provide ways to check consistency between different design views
or artifacts. One can also intentionally introduce redundancy in other software artifacts,
even those that are not entirely formal. For example, one might introduce rules quite
analogous to type declarations for semistructured requirements specification documents,
and thereby enable automatic checks for consistency and some limited kinds of
completeness.
When redundancy is already present - as between a software specification document
and source code - then the remaining challenge is to make sure the information is
represented in a way that facilitates cheap, thorough consistency checks. Checks that
can be implemented by automatic tools are usually preferable, but there is value even in
organizing information to make inconsistency easier to spot in manual inspection.
Of course, one cannot always obtain cheap, thorough checks of source code and other
documents. Sometimes redundancy is exploited instead with run-time checks. Defensive
programming, explicit run-time checks for conditions that should always be true if the
program is executing correctly, is another application of redundancy in programming.
3.3 Restriction
When there are no acceptably cheap and effective ways to check a property,
sometimes one can change the problem by checking a different, more restrictive
property or by limiting the check to a smaller, more restrictive class of programs.
Consider the problem of ensuring that each variable is initialized before it is used, on
every execution. Simple as the property is, it is not possible for a compiler or analysis
tool to precisely determine whether it holds. See the program in Figure 3.2 for an
illustration. Can the variable k ever be uninitialized the first time i is added to it? If
someCondition(0) always returns true, then k will be initialized to zero on the first time
through the loop, before k is incremented, so perhaps there is no potential for a run-time
error - but method someCondition could be arbitrarily complex and might even depend
on some condition in the environment. Java's solution to this problem is to enforce a
stricter, simpler condition: A program is not permitted to have any syntactic control paths
on which an uninitialized reference could occur, regardless of whether those paths could
actually be executed. The program in Figure 3.2 has such a path, so the Java compiler
rejects it.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Figure 3.2: Can the variable k ever be uninitialized the first time i is added to it? The
property is undecidable, so Java enforces a simpler, stricter property.
Java's rule for initialization before use is a program source code restriction that enables
precise, efficient checking of a simple but important property by the compiler. The
Hypertext Transport Protocol (HTTP) 1.0 of the World-Wide-Web, which made Web
servers not only much simpler and more robust but also much easier to test.
3.4 Partition
Partition, often also known as "divide and conquer," is a general engineering principle.
Dividing a complex problem into subproblems to be attacked and solved independently is
probably the most common human problem-solving strategy. Software engineering in
particular applies this principle in many different forms and at almost all development
levels, from early requirements specifications to code and maintenance. Analysis and
testing are no exception: the partition principle is widely used and exploited.
Partitioning can be applied both at process and technique levels. At the process level,
we divide complex activities into sets of simple activities that can be attacked
independently. For example, testing is usually divided into unit, integration, subsystem,
and system testing. In this way, we can focus on different sources of faults at different
steps, and at each step, we can take advantage of the results of the former steps. For
instance, we can use units that have been tested as stubs for integration testing. Some
static analysis techniques likewise follow the modular structure of the software system
to divide an analysis problem into smaller steps.
Many static analysis techniques first construct a model of a system and then analyze the
model. In this way they divide the overall analysis into two subtasks: first simplify the
system to make the proof of the desired properties feasible and then prove the property
with respect to the simplified model. The question "Does this program have the desired
property?" is decomposed into two questions, "Does this model have the desired
property?" and "Is this an accurate model of the program?"
Since it is not possible to execute the program with every conceivable input, systematic
testing strategies must identify a finite number of classes of test cases to execute.
Whether the classes are derived from specifications (functional testing) or from program
structure (structural testing), the process of enumerating test obligations proceeds by
dividing the sources of information into significant elements (clauses or special values
identifiable in specifications, statements or paths in programs), and creating test cases
that cover each such element or certain combinations of elements.
3.5 Visibility
Visibility means the ability to measure progress or status against goals. In software
engineering, one encounters the visibility principle mainly in the form of process visibility,
and then mainly in the form of schedule visibility: ability to judge the state of development
against a project schedule. Quality process visibility also applies to measuring achieved
(or predicted) quality against quality goals. The principle of visibility involves setting goals
that can be assessed as well as devising methods to assess their realization.
Visibility is closely related to observability, the ability to extract useful information from a
software artifact. The architectural design and build plan of a system determines what
will be observable at each stage of development, which in turn largely determines the
visibility of progress against goals at that stage.
A variety of simple techniques can be used to improve observability. For example, it is
no accident that important Internet protocols like HTTP and SMTP (Simple Mail
Transport Protocol, used by Internet mail servers) are based on the exchange of simple
textual commands. The choice of simple, human-readable text rather than a more
compact binary encoding has a small cost in performance and a large payoff in
observability, including making construction of test drivers and oracles much simpler.
Use of human-readable and human-editable files is likewise advisable wherever the
performance cost is acceptable.
A variant of observability through direct use of simple text encodings is providing readers
and writers to convert between other data structures and simple, human- readable and
editable text. For example, when designing classes that implement a complex data
structure, designing and implementing also a translation from a simple text format to the
internal structure, and vice versa, will often pay back handsomely in both ad hoc and
systematic testing. For similar reasons it is often useful to design and implement an
equality check for objects, even when it is not necessary to the functionality of the
software product.
3.6 Feedback
Feedback is another classic engineering principle that applies to analysis and testing.
Feedback applies both to the process itself (process improvement) and to individual
techniques (e.g., using test histories to prioritize regression testing).
Systematic inspection and walkthrough derive part of their success from feedback.
Participants in inspection are guided by checklists, and checklists are revised and
refined based on experience. New checklist items may be derived from root cause
analysis, analyzing previously observed failures to identify the initial errors that lead to
them.
Summary
Principles constitute the core of a discipline. They form the basis of methods,
techniques, methodologies and tools. They permit understanding, comparing, evaluating
and extending different approaches, and they constitute the lasting basis of knowledge
of a discipline.
The six principles described in this chapter are
Sensitivity: better to fail every time than sometimes,
Redundancy: making intentions explicit,
Restriction: making the problem easier,
Partition: divide and conquer,
Visibility: making information accessible, and
Feedback: applying lessons from experience in process and techniques.
Principles are identified heuristically by searching for a common denominator of
techniques that apply to various problems and exploit different methods, sometimes
borrowing ideas from other disciplines, sometimes observing recurrent phenomena.
Potential principles are validated by finding existing and new techniques that exploit the
underlying ideas. Generality and usefulness of principles become evident only with time.
The initial list of principles proposed in this chapter is certainly incomplete. Readers are
invited to validate the proposed principles and identify additional principles.
Further Reading
Analysis and testing is a relatively new discipline. To our knowledge, the principles
underlying analysis and testing have not been discussed in the literature previously.
Some of the principles advocated in this chapter are shared with other software
engineering disciplines and are discussed in many books. A good introduction to
software engineering principles is the third chapter of Ghezzi, Jazayeri, and Mandrioli's
book on software engineering [GJM02].
Exercises
Indicate which principles guided the following choices:
1. Use an externally readable format also for internal files, when possible.
2. Collect and analyze data about faults revealed and removed from the
code.
3.1
3. Separate test and debugging activities; that is, separate the design and
execution of test cases to reveal failures (test) from the localization and
removal of the corresponding faults (debugging).
4. Distinguish test case design from execution.
5. Produce complete fault reports.
6. Use information from test case design to improve requirements and
design specifications.
7. Provide interfaces for fully inspecting the internal state of a class.
architectural design may suggest structures and interfaces that not only facilitate testing
earlier in development, but also make key interfaces simpler and more precisely defined.
There is also another reason for carrying out quality activities at the earliest opportunity
and for preferring earlier to later activities when either could serve to detect the same
fault: The single best predictor of the cost of repairing a software defect is the time
between its introduction and its detection. A defect introduced in coding is far cheaper to
repair during unit test than later during integration or system test, and most expensive if
it is detected by a user of the fielded system. A defect introduced during requirements
engineering (e.g., an ambiguous requirement) is relatively cheap to repair at that stage,
but may be hugely expensive if it is only uncovered by a dispute about the results of a
system acceptance test.
It is quite possible to build systems that are very reliable, relatively free from usefulness
hazards, and completely useless. They may be unbearably slow, or have terrible user
interfaces and unfathomable documentation, or they may be missing several crucial
features. How should these properties be considered in software quality? One answer is
that they are not part of quality at all unless they have been explicitly specified, since
quality is the presence of specified properties. However, a company whose products are
rejected by its customers will take little comfort in knowing that, by some definitions,
they were high-quality products.
We can do better by considering quality as fulfillment of required and desired properties,
as distinguished from specified properties. For example, even if a client does not
explicitly specify the required performance of a system, there is always some level of
performance that is required to be useful.
One of the most critical tasks in software quality analysis is making desired properties
explicit, since properties that remain unspecified (even informally) are very likely to
surface unpleasantly when it is discovered that they are not met. In many cases these
implicit requirements can not only be made explicit, but also made sufficiently precise
that they can be made part of dependability or reliability. For example, while it is better
to explicitly recognize usability as a requirement than to leave it implicit, it is better yet to
augment[1] usability requirements with specific interface standards, so that a deviation
from the standards is recognized as a defect.
[1]Interface
properties that are less dependent on specification and that do distinguish among
failures depending on severity.
Software safety is an extension of the well-established field of system safety into
software. Safety is concerned with preventing certain undesirable behaviors, called
hazards. It is quite explicitly not concerned with achieving any useful behavior apart from
whatever functionality is needed to prevent hazards. Software safety is typically a
concern in "critical" systems such as avionics and medical systems, but the basic
principles apply to any system in which particularly undesirable behaviors can be
distinguished from run-of-the-mill failure. For example, while it is annoying when a word
processor crashes, it is much more annoying if it irrecoverably corrupts document files.
The developers of a word processor might consider safety with respect to the hazard of
file corruption separately from reliability with respect to the complete functional
requirements for the word processor.
Just as correctness is meaningless without a specification of allowed behaviors, safety
is meaningless without a specification of hazards to be prevented, and in practice the
first step of safety analysis is always finding and classifying hazards. Typically, hazards
are associated with some system in which the software is embedded (e.g., the medical
device), rather than the software alone. The distinguishing feature of safety is that it is
concerned only with these hazards, and not with other aspects of correct functioning.
The concept of safety is perhaps easier to grasp with familiar physical systems. For
example, lawn-mowers in the United States are equipped with an interlock device,
sometimes called a "dead-man switch." If this switch is not actively held by the operator,
the engine shuts off. The dead-man switch does not contribute in any way to cutting
grass; its sole purpose is to prevent the operator from reaching into the mower blades
while the engine runs.
One is tempted to say that safety is an aspect of correctness, because a good system
specification would rule out hazards. However, safety is best considered as a quality
distinct from correctness and reliability for two reasons. First, by focusing on a few
hazards and ignoring other functionality, a separate safety specification can be much
simpler than a complete system specification, and therefore easier to verify. To put it
another way, while a good system specification should rule out hazards, we cannot be
confident that either specifications or our attempts to verify systems are good enough to
provide the degree of assurance we require for hazard avoidance. Second, even if the
safety specification were redundant with regard to the full system specification, it is
important because (by definition) we regard avoidance of hazards as more crucial than
satisfying other parts of the system specification.
Correctness and reliability are contingent upon normal operating conditions. It is not
reasonable to expect a word processing program to save changes normally when the
file does not fit in storage, or to expect a database to continue to operate normally when
the computer loses power, or to expect a Web site to provide completely satisfactory
service to all visitors when the load is 100 times greater than the maximum for which it
was designed. Software that fails under these conditions, which violate the premises of
its design, may still be "correct" in the strict sense, yet the manner in which the software
fails is important. It is acceptable that the word processor fails to write the new file that
does not fit on disk, but unacceptable to also corrupt the previous version of the file in
the attempt. It is acceptable for the database system to cease to function when the
power is cut, but unacceptable for it to leave the database in a corrupt state. And it is
usually preferable for the Web system to turn away some arriving users rather than
becoming too slow for all, or crashing. Software that gracefully degrades or fails "softly"
outside its normal operating parameters is robust.
Software safety is a kind of robustness, but robustness is a more general notion that
concerns not only avoidance of hazards (e.g., data corruption) but also partial
functionality under unusual situations. Robustness, like safety, begins with explicit
consideration of unusual and undesirable situations, and should include augmenting
software specifications with appropriate responses to undesirable events.
Figure 4.1 illustrates the relation among dependability properties.
Quality analysis should be part of the feasibility study. The sidebar on page 47 shows an
excerpt of the feasibility study for the Chipmunk Web presence. The primary quality
requirements are stated in terms of dependability, usability, and security. Performance,
portability and interoperability are typically not primary concerns at this stage, but they
may come into play when dealing with other qualities.
4.5 Analysis
Analysis techniques that do not involve actual execution of program source code play a
prominent role in overall software quality processes. Manual inspection techniques and
automated analyses can be applied at any development stage. They are particularly well
suited at the early stages of specifications and design, where the lack of executability of
many intermediate artifacts reduces the efficacy of testing.
Inspection, in particular, can be applied to essentially any document including
requirements documents, architectural and more detailed design documents, test plans
and test cases, and of course program source code. Inspection may also have
secondary benefits, such as spreading good practices and instilling shared standards of
quality. On the other hand, inspection takes a considerable amount of time and requires
meetings, which can become a scheduling bottleneck. Moreover, re-inspecting a
changed component can be as expensive as the initial inspection. Despite the versatility
of inspection, therefore, it is used primarily where other techniques are either
inapplicable or where other techniques do not provide sufficient coverage of common
faults.
Automated static analyses are more limited in applicability (e.g., they can be applied to
some formal representations of requirements models but not to natural language
documents), but are selected when available because substituting machine cycles for
human effort makes them particularly cost-effective. The cost advantage of automated
static analyses is diminished by the substantial effort required to formalize and properly
structure a model for analysis, but their application can be further motivated by their
ability to thoroughly check for particular classes of faults for which checking with other
techniques is very difficult or expensive. For example, finite state verification techniques
for concurrent systems requires construction and careful structuring of a formal design
model, and addresses only a particular family of faults (faulty synchronization structure).
Yet it is rapidly gaining acceptance in some application domains because that family of
faults is difficult to detect in manual inspection and resists detection through dynamic
testing.
Excerpt of Web Presence Feasibility Study
Purpose of this document
This document was prepared for the Chipmunk IT management team. It describes the
results of a feasibility study undertaken to advise Chipmunk corporate management
whether to embark on a substantial redevelopment effort to add online shopping
functionality to the Chipmunk Computers' Web presence.
Goals
Sometimes the best aspects of manual inspection and automated static analysis can be
obtained by carefully decomposing properties to be checked. For example, suppose a
desired property of requirements documents is that each special term in the application
domain appear in a glossary of terms. This property is not directly amenable to an
automated static analysis, since current tools cannot distinguish meaningful domain
terms from other terms that have their ordinary meanings. The property can be checked
with manual inspection, but the process is tedious, expensive, and error-prone. A hybrid
approach can be applied if each domain term is marked in the text. Manually checking
that domain terms are marked is much faster and therefore less expensive than
manually looking each term up in the glossary, and marking the terms permits effective
automation of cross-checking with the glossary.
4.6 Testing
Despite the attractiveness of automated static analyses when they
are applicable, and despite the usefulness of manual inspections for
a variety of documents including but not limited to program source
code, dynamic testing remains a dominant technique. A closer look,
though, shows that dynamic testing is really divided into several
distinct activities that may occur at different points in a project.
Tests are executed when the corresponding code is available, but
testing activities start earlier, as soon as the artifacts required for
designing test case specifications are available. Thus, acceptance
and system test suites should be generated before integration and
unit test suites, even if executed in the opposite order.
Early test design has several advantages. Tests are specified
independently from code and when the corresponding software
specifications are fresh in the mind of analysts and developers,
facilitating review of test design. Moreover, test cases may highlight
inconsistencies and incompleteness in the corresponding software
specifications. Early design of test cases also allows for early repair
of software specifications, preventing specification faults from
propagating to later stages in development. Finally, programmers
may use test cases to illustrate and clarify the software
specifications, especially for errors and unexpected conditions.
No engineer would build a complex structure from parts that have not themselves been
subjected to quality control. Just as the "earlier is better" rule dictates using inspection to
reveal flaws in requirements and design before they are propagated to program code, the
same rule dictates module testing to uncover as many program faults as possible before
they are incorporated in larger subsystems of the product. At Chip- munk, developers are
expected to perform functional and structural module testing before a work assignment is
considered complete and added to the project baseline. The test driver and auxiliary files
are part of the work product and are expected to make reexecution of test cases, including
result checking, as simple and automatic as possible, since the same test cases will be
used over and over again as the product evolves.
programming languages with unconstrained pointers and without array bounds checking,
which may in turn be attributed to performance concerns and a requirement for
interoperability with a large body of legacy code. The countermeasure could involve
differences in programming methods (e.g., requiring use of certified "safe" libraries for
buffer management), or improvements to quality assurance activities (e.g., additions to
inspection checklists), or sometimes changes in management practices.
Summary
Test and analysis activities are not a late phase of the development process, but rather
a wide set of activities that pervade the whole process. Designing a quality process with
a suitable blend of test and analysis activities for the specific application domain,
development environment, and quality goals is a challenge that requires skill and
experience.
A well-defined quality process must fulfill three main goals: improving the software
product during and after development, assessing its quality before delivery, and
improving the process within and across projects. These challenging goals can be
achieved by increasing visibility, scheduling activities as early as practical, and
monitoring results to adjust the process. Process visibility - that is, measuring and
comparing progress to objectives - is a key property of the overall development
process. Performing A&T activities early produces several benefits: It increases control
over the process, it hastens fault identification and reduces the costs of fault removal, it
provides data for incrementally tuning the development process, and it accelerates
product delivery. Feedback is the key to improving the process by identifying and
removing persistent errors and faults.
Further Reading
Qualities of software are discussed in many software engineering textbooks; the
discussion in Chapter 2 of Ghezzi, Jazayeri, and Mandrioli [GJM02] is particularly useful.
Process visibility is likewise described in software engineering textbooks, usually with an
emphasis on schedule. Musa [Mus04] describes a quality process oriented particularly
to establishing a quantifiable level of reliability based on models and testing before
release. Chillarege et al. [CBC+92] present principles for gathering and analyzing fault
data, with an emphasis on feedback within a single process but applicable also to quality
process improvement.
Exercises
We have stated that 100% reliability is indistinguishable from correctness, but they
are not quite identical. Under what circumstance might an incorrect program be
4.1 100% reliable? Hint: Recall that a program may be more or less reliable depending
on how it is used, but a program is either correct or incorrect regardless of usage.
We might measure the reliability of a network router as the fraction of all packets
4.2 that are correctly routed, or as the fraction of total service time in which packets
are correctly routed. When might these two measures be different?
If I am downloading a very large file over a slow modem, do I care more about the
4.3 availability of my internet service provider or its mean time between failures?
Chapter List
Chapter 5: Finite Models
Chapter 6: Dependence and Data Flow Models
Chapter 7: Symbolic Execution and Proof of Properties
Chapter 8: Finite State Verification
5.1 Overview
A model is a representation that is simpler than the artifact it represents but preserves (or
at least approximates) some important attributes of the actual artifact. Our concern in this
chapter is with models of program execution, and not with models of other (equally
important) attributes such as the effort required to develop the software or its usability. A
good model of (or, more precisely, a good class of models) must typically be:
latter, and we may choose conventional concurrency control protocols over novel
approaches for the same reason. However, if a program analysis technique for C programs
is applicable only to programs without pointer variables, we are unlikely to find much use for
it.
Since design models are intended partly to aid in making and evaluating design decisions,
they should share these characteristics with models constructed primarily for analysis.
However, some kinds of models - notably the widely used UML design notations - are
designed primarily for human communication, with less attention to semantic meaning and
prediction.
Models are often used indirectly in evaluating an artifact. For example, some models are
not themselves analyzed, but are used to guide test case selection. In such cases, the
qualities of being predictive and semantically meaningful apply to the model together with
the analysis or testing technique applied to another artifact, typically the actual program or
system.
Graph Representations
We often use directed graphs to represent models of programs.
Usually we draw them as "box and arrow" diagrams, but to reason
about them it is important to understand that they have a welldefined mathematical meaning, which we review here.
A directed graph is composed of a set of nodes N and a relation E on the set (that is, a set
of ordered pairs), called the edges. It is conventional to draw the nodes as points or
shapes and to draw the edges as arrows. For example:
Typically, the nodes represent entities of some kind, such as procedures or classes or
regions of source code. The edges represent some relation among the entities. For
example, if we represent program control flow using a directed graph model, an edge (a,b)
would be interpreted as the statement "program region a can be directly followed by
program region b in program execution."
We can label nodes with the names or descriptions of the entities they represent. If nodes a
and b represent program regions containing assignment statements, we might draw the two
nodes and an edge (a,b) connecting them in this way:
Drawings of graphs can be refined in many ways, for example, depicting some relations as
attributes rather than directed edges. Important as these presentation choices may be for
clear communication, only the underlying sets and relations matter for reasoning about
models.
since the first and third program execution steps modify only the
omitted attribute. The relation between (2a) and (2b) illustrates
introduction of nondeterminism, because program execution states
with different successor states have been merged.
Finite models of program execution are inevitably imperfect. Collapsing the potentially
infinite states of actual execution into a finite number of representative model states
necessarily involves omitting some information. While one might hope that the omitted
information is irrelevant to the property one wishes to verify, this is seldom completely true.
In Figure 5.1, parts 2(a) and 2(b) illustrate how abstraction can cause a set of deterministic
transitions to be modeled by a nondeterministic choice among transitions, thus making the
analysis imprecise. This in turn can lead to "false alarms" in analysis of models.
[1]We put aside, for the moment, the possibility of parallel or
Figure 5.2: Building blocks for constructing intraprocedural control flow graphs. Other
control constructs are represented analogously. For example, the for construct of C,
C++, and Java is represented as if the initialization part appeared before a while
loop, with the increment part at the end of the while loop body.
In terms of program execution, we can say that a control flow graph model retains some
information about the program counter (the address of the next instruction to be
executed), and elides other information about program execution (e.g., the values of
variables). Since information that determines the outcome of conditional branches is
elided, the control flow graph represents not only possible program paths but also some
paths that cannot be executed. This corresponds to the introduction of nondeterminism
illustrated in Figure 5.1.
The nodes in a control flow graph could represent individual program statements, or
even individual machine operations, but it is desirable to make the graph model as
compact and simple as possible. Usually, therefore, nodes in a control flow graph model
of a program represent not a single point but rather a basic block, a maximal program
/**
* Remove/collapse multiple newline characters.
*
* @param String string to collapse newlines in.
* @return String
*/
public static String collapseNewlines(String argStr)
{
char last = argStr.charAt(0);
StringBuffer argBuf = new StringBuffer();
Figure 5.3: A Java method to collapse adjacent newline characters, from the
StringUtilities class of the Velocity project of the open source Apache project. (c)
2001 Apache Software Foundation, used with permission.
Figure 5.4: A control flow graph corresponding to the Java method in Figure 5.3. The
for statement and the predicate of the if statement have internal control flow
branches, so those statements are broken across basic blocks.
Some analysis algorithms are simplified by introducing a distinguished node to represent
procedure entry and another to represent procedure exit. When these distinguished start
and end nodes are used in a CFG, a directed edge leads from the start node to the
node representing the first executable block, and a directed edge from each procedure
exit (e.g., each return statement and the last sequential block in the program) to the
distinguished end node. Our practice will be to draw a start node identified with the
procedure or method signature, and to leave the end node implicit.
The intraprocedural control flow graph may be used directly to define thoroughness
criteria for testing (see Chapters 9 and 12). Often the control flow graph is used to
define another model, which in turn is used to define a thoroughness criterion. For
example, some criteria are defined by reference to linear code sequences and jumps
(LCSAJs), which are essentially subpaths of the control flow graph from one branch to
another. Figure 5.5 shows the LCSAJs derived from the control flow graph of Figure 5.4.
From Sequence of Basic Blocks To
entry b1 b2 b3
jX
entry b1 b2 b3 b4
jT
entry b1 b2 b3 b4 b5
jE
entry b1 b2 b3 b4 b5 b6 b7
jL
jX
jL
b8 return
b3 b4
jT
jL
b3 b4 b5
jE
jL
b3 b4 b5 b6 b7
jL
Figure 5.5: Linear code sequences and jumps (LCSAJs) corresponding to the Java
method in Figure 5.3 and the control flow graph in Figure 5.4. Note that proceeding to
the next sequential basic block is not considered a "jump" for purposes of identifying
LCSAJs.
For use in analysis, the control flow graph is usually augmented with other information.
For example, the data flow models described in the next chapter are constructed using a
CFG model augmented with information about the variables accessed and modified by
each program statement.
Not all control flow is represented explicitly in program text. For example, if an empty
string is passed to the collapseNewlines method of Figure 5.3, the exception
java.lang.StringIndexOutOfBoundsException will be thrown by String.charAt, and
execution of the method will be terminated. This could be represented in the CFG as a
directed edge to an exit node. However, if one includes such implicit control flow edges
for every possible exception (for example, an edge from each reference that might lead
to a null pointer exception), the CFG becomes rather unwieldy.
More fundamentally, it may not be simple or even possible to determine which of the
implicit control flow edges can actually be executed. We can reason about the call to
argStr.charAt(cIdx) within the body of the for loop and determine that cIdx must always
be within bounds, but we cannot reasonably expect an automated tool for extracting
control flow graphs to perform such inferences. Whether to include some or all implicit
control flow edges in a CFG representation therefore involves a trade-off between
possibly omitting some execution paths or representing many spurious paths. Which is
preferable depends on the uses to which the CFG representation will be put.
Even the representation of explicit control flow may differ depending on the uses to
which a model is put. In Figure 5.3, the for statement has been broken into its
constituent parts (initialization, comparison, and increment for next iteration), each of
which appears at a different point in the control flow. For some kinds of analysis, this
breakdown would serve no useful purpose. Similarly, a complex conditional expression in
Java or C is executed by "short-circuit" evaluation, so the single expression i > 0&&i < 10
can be broken across two basic blocks (the second test is not executed if the first
evaluates to false). If this fine level of execution detail is not relevant to an analysis, we
may choose to ignore short-circuit evaluation and treat the entire conditional expression
as if it were fully evaluated.
Figure 5.6: Overapproximation in a call graph. Although the method A.check() can
never actually call C.foo(), a typical call graph construction will include it as a possible
call.
If a call graph model represents different behaviors of a procedure depending on where
the procedure is called, we call it context-sensitive. For example, a context-sensitive
model of collapseNewlines might distinguish between one call in which the argument
string cannot possibly be empty, and another in which it could be. Contextsensitive
analyses can be more precise than context-insensitive analyses when the model includes
some additional information that is shared or passed among procedures. Information not
only about the immediate calling context, but about the entire chain of procedure calls
may be needed, as illustrated in Figure 5.7. In that case the cost of context-sensitive
analysis depends on the number of paths from the root (main program) to each lowest
level procedure. The number of paths can be exponentially larger than the number of
procedures, as illustrated in Figure 5.8.
Figure 5.7: The Java code above can be represented by the context-insensitive call
graph at left. However, to capture the fact that method depends never attempts to
store into a nonexistent array element, it is necessary to represent parameter values
that differ depending on the context in which depends is called, as in the contextsensitive call graph on the right.
Figure 5.8: The number of paths in a call graph - and therefore the number of calling
contexts in a context-sensitive analysis - can be exponentially larger than the number
of procedures, even without recursion.
1 public class C {
2
3
public static C cFactory(String kind) {
4
if (kind == "C") return new C();
5
if (kind == "S") return new S();
6
return null;
7
}
8
9
void foo() {
10
System.out.println("You called the parent's method");
11
}
12
13
public static void main(String args[]) {
14
(new A()).check();
15
}
16 }
17
18 class S extends C {
19
void foo() {
20
System.out.println("You called the child's method"
21
}
22 }
23
24 class A {
25
void check() {
26
C myC = C.cFactory("S");
27
myC.foo();
28
}
29 }
The Java compiler uses a typical call graph model to enforce the language rule that all
checked exceptions are either handled or declared in each method. The throws clauses
in a method declaration are provided by the programmer, but if they were not, they
would correspond exactly to the information that a context insensitive analysis of
exception propagation would associate with each procedure (which is why the compiler
can check for completeness and complain if the programmer omits an exception that can
be thrown).
Figure 5.9: Finite state machine (Mealy machine) description of line-end conversion
procedure, depicted as a state transition diagram (top) and as a state transition table
(bottom). An omission is obvious in the tabular representation, but easy to overlook in
the state transition diagram.
There are three kinds of correctness relations that we may reason about with respect to
finite state machine models, illustrated in Figure 5.10. The first is internal properties,
such as completeness and determinism. Second, the possible executions of a model,
described by paths through the FSM, may satisfy (or not) some desired property. Third,
the finite state machine model should accurately represent possible behaviors of the
program. Equivalently, the program should be a correct implementation of the finite state
machine model. We will consider each of the three kinds of correctness relation in turn
with respect to the FSM model of Figure 5.9.
Figure 5.10: Correctness relations for a finite state machine model. Consistency and
completeness are internal properties, independent of the program or a higher-level
specification. If, in addition to these internal properties, a model accurately represents
a program and satisfies a higher-level specification, then by definition the program
itself satisfies the higher-level specification.
The graph on the right is called the dual of the graph on the left. Taking the dual of
the graph on the right, one obtains again the graph on the left.
The choice between associating nodes or edges with computations performed by a
program is only a matter of convention and convenience, and is not an important
difference between CFG and FSM models. In fact, aside from this minor difference in
customary presentation, the control flow graph is a particular kind of finite state
machine model in which the abstract states preserve some information about control
flow (program regions and their execution order) and elide all other information about
program state.
Many details are purposely omitted from the FSM model depicted in Figure 5.9, but it is
also incomplete in an undesirable way. Normally, we require a finite state machine
specification to be complete in the sense that it prescribes the allowed behavior(s) for
any possible sequence of inputs or events. For the line-end conversion specification, the
state transition diagram does not include a transition from state l on carriage return; that
is, it does not specify what the program should do if it encounters a carriage return
immediately after a line feed.
An alternative representation of finite state machines, including Mealy machines, is the
state transition table, also illustrated in Figure 5.9. There is one row in the transition
table for each state node and one column for each event or input. If the FSM is
complete and deterministic, there should be exactly one transition in each table entry.
Since this table is for a Mealy machine, the transition in each table entry indicates both
the next state and the response (e.g., d / emit means "emit and then proceed to state
d"). The omission of a transition from state l on a carriage return is glaringly obvious
when the state transition diagram is written in tabular form.
Concrete state
w (Within line)
12
0 >0
12
d (Done)
35
34
35
}
}
Figure 5.11: Procedure to convert among Dos, Unix, and Macintosh line
ends.
Open table as spreadsheet
LF
CR
e e / emit l / emit
EOF
other
d/ w / append
e/ l / emit
d/ w / append
Figure 5.12: Completed finite state machine (Mealy machine) description of line-end
conversion procedure, depicted as a state-transition table (bottom). The omitted
transition in Figure 5.9 has been added.
With this state abstraction function, we can check conformance between the source
code and each transition in the FSM. For example, the transition from state e to state l
is interpreted to mean that, if execution is at the head of the loop with pos equal to zero
and atCR also zero (corresponding to state e), and the next character encountered is a
carriage return, then the program should perform operations corresponding to the emit
action and then enter a state in which pos is zero and atCR is 1 (corresponding to state
l). It is easy to verify that this transition is implemented correctly. However, if we
examine the transition from state l to state w, we will discover that the code does not
correspond because the variable atCR is not reset to zero, as it should be. If the
program encounters a carriage return, then some text, and then a line feed, the line feed
will be discarded - a program fault.
The fault in the conversion program was actually detected by the authors through
testing, and not through manual verification of correspondence between each transition
and program source code. Making the abstraction function explicit was nonetheless
important to understanding the nature of the error and how to repair it.
Summary
Models play many of the same roles in software development as in engineering of other
kinds of artifacts. Models must be much simpler than the artifacts they describe, but
must preserve enough essential detail to be useful in making choices. For models of
software execution, this means that a model must abstract away enough detail to
represent the potentially infinite set of program execution states by a finite and suitably
compact set of model states.
Some models, such as control flow graphs and call graphs, can be extracted from
programs. The key trade-off for these extracted models is precision (retaining enough
information to be predictive) versus the cost of producing and storing the model. Other
models, including many finite state machine models, may be constructed before the
program they describe, and serve as a kind of intermediate-level specification of
intended behavior. These models can be related to both a higher-level specification of
intended behavior and the actual program they are intended to describe.
The relation between finite state models and programs is elaborated in Chapter 6.
Analysis of models, particularly those involving concurrent execution, is described further
in Chapter 8.
Further Reading
Finite state models of computation have been studied at least since the neural models of
McColloch and Pitts [MP43], and modern finite state models of programs remain close
to those introduced by Mealy [Mea55] and Moore [Moo56]. Lamport [Lam89] provides
the clearest and most accessible introduction the authors know regarding what a finite
state machine model "means" and what it means for a program to conform to it. Guttag
[Gut77] presents an early explication of the abstraction relation between a model and a
program, and why the abstraction function goes from concrete to abstract and not vice
versa. Finite state models have been particularly important in development of reasoning
and tools for concurrent (multi-threaded, parallel, and distributed) systems; Pezz,
Taylor, and Young [PTY95] overview finite models of concurrent programs.
Exercises
We construct large, complex software systems by breaking them into manageable
pieces. Likewise, models of software systems may be decomposed into more
5.1 manageable pieces. Briefly describe how the requirements of model compactness,
predictiveness, semantic meaningfulness, and sufficient generality apply to
approaches for modularizing models of programs. Give examples where possible.
Models are used in analysis, but construction of models from programs often
requires some form of analysis. Why bother, then? If one is performing an initial
analysis to construct a model to perform a subsequent analysis, why not just
5.2 merge the initial and subsequent analysis and dispense with defining and
constructing the model? For example, if one is analyzing Java code to construct a
call graph and class hierarchy that will be used to detect overriding of inherited
methods, why not just analyze the source code directly for method overriding?
Linear code sequence and jump (LCSAJ) makes a distinction between "sequential"
control flow and other control flow. Control flow graphs, on the other hand, make
5.3 no distinction between sequential and nonsequential control flow. Considering the
criterion of model predictiveness, is there a justification for this distinction?
What upper bound can you place on the number of basic blocks in a program,
5.4 relative to program size?
A directed graph is a set of nodes and a set of directed edges. A mathematical
relation is a set of ordered pairs.
1. If we consider a directed graph as a representation of a relation, can we
ever have two distinct edges from one node to another?
5.5
with other abstraction functions used in reasoning about programs, the mapping is
from concrete representation to abstract representation, and not from abstract to
concrete. This is because the mapping from concrete to abstract is many-to-one, and its
inverse is therefore not a mathematical function (which by definition maps each object in
the domain set into a single object in the range).
{ /* A: def x,y */
/* def tmp
/* B: use
/*
/*
/*
/* F: use
Figure 6.1: Java implementation of Euclid's algorithm for calculating the greatest
common denominator of two positive integers. The labels AF are provided to relate
statements in the source code to graph nodes in subsequent figures.
Each definition-use pair associates a definition of a variable (e.g., the assignment to y in
line 6) with a use of the same variable (e.g., the expression y!=0 in line 3). A single
definition can be paired with more than one use, and vice versa. For example, the
definition of variable y in line 6 is paired with a use in line 3 (in the loop test), as well as
additional uses in lines 4 and 5. The definition of x in line 5 is associated with uses in
lines 4 and 8.
A definition-use pair is formed only if there is a program path on which the value
assigned in the definition can reach the point of use without being overwritten by another
value. If there is another assignment to the same value on the path, we say that the first
definition is killed by the second. For example, the declaration of tmp in line 2 is not
paired with the use of tmp in line 6 because the definition at line 2 is killed by the
definition at line 4. A definition-clear path is a path from definition to use on which the
definition is not killed by another definition of the same variable. For example, with
reference to the node labels in Figure 6.2, path E,B,C,D is a definition-clear path from
the definition of y in line 6 (node E of the control flow graph) to the use of y in line 5
(node D). Path A,B,C,D,E is not a definition-clear path with respect to tmp because of
the intervening definition at node C.
Figure 6.3: Data dependence graph of GCD method in Figure 6.1, with nodes for
statements corresponding to the control flow graph in Figure 6.2. Each directed edge
represents a direct data dependence, and the edge label indicates the variable that
transmits a value from the definition at the head of the edge to the use at the tail of
the edge.
The data dependence graph in Figure 6.3 captures only dependence through flow of
data. Dependence of the body of the loop on the predicate governing the loop is not
represented by data dependence alone. Control dependence can also be represented
with a graph, as in Figure 6.5, which shows the control dependencies for the GCD
method. The control dependence graph shows direct control dependencies, that is,
where execution of one statement controls whether another is executed. For example,
execution of the body of a loop or if statement depends on the result of a predicate.
Control dependence differs from the sequencing information captured in the control flow
graph. The control flow graph imposes a definite order on execution even when two
statements are logically independent and could be executed in either order with the
same results. If a statement is control- or data-dependent on another, then their order of
execution is not arbitrary. Program dependence representations typically include both
data dependence and control dependence information in a single graph with the two
kinds of information appearing as different kinds of edges among the same set of nodes.
A node in the control flow graph that is reached on every execution path from entry point
to exit is control dependent only on the entry point. For any other node N, reached on
some but not all execution paths, there is some branch that controls execution of N in the
sense that, depending on which way execution proceeds from the branch, execution of N
either does or does not become inevitable. It is this notion of control that control
dependence captures.
The notion of dominators in a rooted, directed graph can be used to make this intuitive
notion of "controlling decision" precise. Node M dominates node N if every path from the
root of the graph to N passes through M. A node will typically have many dominators,
but except for the root, there is a unique immediate dominator of node N, which is
closest to N on any path from the root and which is in turn dominated by all the other
dominators of N. Because each node (except the root) has a unique immediate
dominator, the immediate dominator relation forms a tree.
The point at which execution of a node becomes inevitable is related to paths from a
node to the end of execution - that is, to dominators that are calculated in the reverse of
the control flow graph, using a special "exit" node as the root. Dominators in this
direction are called post-dominators, and dominators in the normal direction of execution
can be called pre-dominators for clarity.
We can use post-dominators to give a more precise definition of control dependence.
Consider again a node N that is reached on some but not all execution paths. There
must be some node C with the following property: C has at least two successors in the
control flow graph (i.e., it represents a control flow decision); C is not post-dominated by
N (N is not already inevitable when C is reached); and there is a successor of C in the
control flow graph that is post-dominated by N. When these conditions are true, we say
node N is control-dependent on node C. Figure 6.4 illustrates the control dependence
calculation for one node in the GCD example, and Figure 6.5 shows the control
dependence relation for the method as a whole.
Figure 6.4: Calculating control dependence for node E in the control flow graph of the
GCD method. Nodes C, D, and E in the gray region are post-dominated by E; that is,
execution of E is inevitable in that region. Node B has successors both within and
outside the gray region, so it controls whether E is executed; thus E is
controldependent on B.
Figure 6.5: Control dependence tree of the GCD method. The loop test and the
return statement are reached on every possible execution path, so they are controldependent only on the entry point. The statements within the loop are controldependent on the loop test.
This rule can be broken down into two parts to make it a little more intuitive and more
efficient to implement. The first part describes how node E receives values from its
predecessor D, and the second describes how it modifies those values for its
successors:
In this form, we can easily express what should happen at the head of the while loop
(node B in Figure 6.2), where values may be transmitted both from the beginning of the
procedure (node A) and through the end of the body of the loop (node E). The beginning
of the procedure (node A) is treated as an initial definition of parameters and local
variables. (If a local variable is declared but not initialized, it is treated as a definition to
the special value "uninitialized.")
Remarkably, the reaching definitions can be calculated simply and efficiently, first
initializing the reaching definitions at each node in the control flow graph to the empty
set, and then applying these equations repeatedly until the results stabilize. The
algorithm is given as pseudocode in Figure 6.6.
Algorithm Reaching definitions
Input:
Reach(n)= mpred(n)ReachOut(m);
ReachOut(n)=(Reach(n) \ kill(n)) gen(n) ;
if ( ReachOut(n) = oldVal ) then
// Propagate changed value to successor nodes
workList = workList succ(n)
end if;
end loop;
Figure 6.6: An iterative work-list algorithm to compute reaching definitions by
applying each flow equation until the solution stabilizes.
The similarity to the set equations for reaching definitions is striking. Both propagate
sets of values along the control flow graph in the direction of program execution (they
are forward analyses), and both combine sets propagated along different control flow
paths. However, reaching definitions combines propagated sets using set union, since a
definition can reach a use along any execution path. Available expressions combines
propagated sets using set intersection, since an expression is considered available at a
node only if it reaches that node along all possible execution paths. Thus we say that,
while reaching definitions is a forward, any-path analysis, available expressions is a
forward, all-paths analysis. A work-list algorithm to implement available expressions
analysis is nearly identical to that for reaching definitions, except for initialization and the
flow equations, as shown in Figure 6.7.
G, K, and U can be any events we care to check, so long as we can mark their
occurrences in a control flow graph.
An example problem of this kind is variable initialization. We noted in Chapter 3 that Java
requires a variable to be initialized before use on all execution paths. The analysis that
enforces this rule is an instance of Avail. The tokens propagated through the control flow
graph record which variables have been assigned initial values. Since there is no way to
"uninitialize" a variable in Java, the kill sets are empty. Figure 6.8 repeats the source
code of an example program from Chapter 3. The corresponding control flow graph is
shown with definitions and uses in Figure 6.9 and annotated with gen and kill sets for the
initialized variable check in Figure 6.10.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Figure 6.9: Control flow graph of the source code in Figure 6.8, annotated with
variable definitions and uses.
Figure 6.10: Control flow graph of the source code in Figure 6.8, annotated with gen
and kill sets for checking variable initialization using a forward, all-paths Avail
analysis. (Empty gen and kill sets are omitted.) The Avail set flowing from node G to
node C will be {i,k}, but the Avail set flowing from node B to node C is {i}. The allpaths analysis intersects these values, so the resulting Avail (C) is {i}. This value
propagates through nodes C and D to node F, which has a use of k as well as a
definition. Since k Avail(F), a possible use of an uninitialized variable is
detected.
Reaching definitions and available expressions are forward analyses; that is, they
propagate values in the direction of program execution. Given a control flow graph
model, it is just as easy to propagate values in the opposite direction, backward from
nodes that represent the next steps in computation. Backward analyses are useful for
determining what happens after an event of interest. Live variables is a backward
analysis that determines whether the value held in a variable may be subsequently used.
Because a variable is considered live if there is any possible execution path on which it is
used, a backward, any-path analysis is used.
A variable is live at a point in the control flow graph if, on some execution path, its
current value may be used before it is changed. Live variables analysis can be
expressed as set equations as before. Where Reach and Avail propagate values to a
node from its predecessors, Live propagates values from the successors of a node. The
gen sets are variables used at a node, and the kill sets are variables whose values are
replaced. Set union is used to combine values from adjacent nodes, since a variable is
live at a node if it is live at any of the succeeding nodes.
These set equations can be implemented using a work-list algorithm analogous to those
already shown for reaching definitions and available expressions, except that successor
edges are followed in place of predecessors and vice versa.
Like available expressions analysis, live variables analysis is of interest in testing and
analysis primarily as a pattern for recognizing properties of a certain form. A backward,
any-paths analysis allows us to check properties of the following form:
"After D occurs, there is at least one execution path on which G occurs with no
intervening occurrence of K."
Again we choose tokens that represent properties, using gen sets to mark occurrences
of G events (where a property becomes true) and kill sets to mark occurrences of K
events (where a property ceases to be true).
One application of live variables analysis is to recognize useless definitions, that is,
assigning a value that can never be used. A useless definition is not necessarily a
program error, but is often symptomatic of an error. In scripting languages like Perl and
Python, which do not require variables to be declared before use, a useless definition
typically indicates that a variable name has been misspelled, as in the common gateway
interface (CGI) script of Figure 6.11.
1
2
3
class SampleForm(FormData):
""" Used with Python cgi module
to hold and validate data
4
5
6
7
8
9
10
11
12
13
14
15
Figure 6.11: Part of a CGI program (Web form processing) in Python. The
misspelled variable name in the data validation method will be implicitly declared and
will not be rejected by the Python compiler or interpreter, which could allow invalid
data to be treated as valid. The classic live variables data flow analysis can show that
the assignment to valid is a useless definition, suggesting that the programmer
probably intended to assign the value to a different variable.
We have so far seen a forward, any-path analysis (reaching definitions), a forward, allpaths analysis (available definitions), and a backward, any-path analysis (live variables).
One might expect, therefore, to round out the repertoire of patterns with a backward,
all-paths analysis, and this is indeed possible. Since there is no classical name for this
combination, we will call it "inevitability" and use it for properties of the form
"After D occurs, G always occurs with no intervening occurrence of K"
or, informally,
"D inevitably leads to G before K"
Examples of inevitability checks might include ensuring that interrupts are reenabled after
executing an interrupt-handling routine in low-level code, files are closed after opening
them, and so on.
any variable that is assigned a tainted value at that point. Sets of tainted variables would
be propagated forward to a node from its predecessors, with set union where a node in
the control flow graph has more than one predecessor (e.g., the head of a loop).
There is one fundamental difference between such an analysis and the classic data flow
analyses we have seen so far: The gen and kill sets associated with a program point are
not constants. Whether or not the value assigned to a variable is tainted (and thus
whether the variable belongs in the gen set or in the kill set) depends on the set of
tainted variables at that program point, which will vary during the course of the analysis.
There is a kind of circularity here - the gen set and kill set depend on the set of tainted
variables, and the set of tainted variables may in turn depend on the gen and kill set.
Such circularities are common in defining flow analyses, and there is a standard
approach to determining whether they will make the analysis unsound. To convince
ourselves that the analysis is sound, we must show that the output values computed by
each flow equation are monotonically increasing functions of the input values. We will
say more precisely what "increasing" means below.
The determination of whether a computed value is tainted will be a simple function of the
set of tainted variables at a program point. For most operations of one or more
arguments, the output is tainted if any of the inputs are tainted. As in Perl, we may
designate one or a few operations (operations used to check an input value for validity)
as taint removers. These special operations always return an untainted value regardless
of their inputs.
Suppose we evaluate the taintedness of an expression with the input set of tainted
variables being {a,b}, and again with the input set of tainted variables being {a,b,c}. Even
without knowing what the expression is, we can say with certainty that if the expression
is tainted in the first evaluation, it must also be tainted in the second evaluation, in which
the set of tainted input variables is larger. This also means that adding elements to the
input tainted set can only add elements to the gen set for that point, or leave it the
same, and conversely the kill set can only grow smaller or stay the same. We say that
the computation of tainted variables at a point increases monotonically.
To be more precise, the monotonicity argument is made by arranging the possible
values in a lattice. In the sorts of flow analysis framework considered here, the lattice is
almost always made up of subsets of some set (the set of definitions, or the set of
tainted variables, etc.); this is called a powerset lattice because the powerset of set A is
the set of all subsets of A. The bottom element of the lattice is the empty set, the top is
the full set, and lattice elements are ordered by inclusion as in Figure 6.12. If we can
follow the arrows in a lattice from element x to element y (e.g., from {a} to {a,b,c}), then
we say y > x. A function f is monotonically increasing if
Figure 6.12: The powerset lattice of set {a,b,c}. The powerset contains all subsets of
the set and is ordered by set inclusion.
Not only are all of the individual flow equations for taintedness monotonic in this sense,
but in addition the function applied to merge values where control flow paths come
together is also monotonic:
If we have a set of data flow equations that is monotonic in this sense, and if we begin
by initializing all values to the bottom element of the lattice (the empty set in this case),
then we are assured that an iterative data flow analysis will converge on a unique
minimum solution to the flow equations.
The standard data flow analyses for reaching definitions, live variables, and available
expressions can all be justified in terms of powerset lattices. In the case of available
expressions, though, and also in the case of other all-paths analyses such as the one we
have called "inevitability," the lattice must be flipped over, with the empty set at the top
and the set of all variables or propositions at the bottom. (This is why we used the set of
all tokens, rather than the empty set, to initialize the Avail sets in Figure 6.7.)
a[i] = 13;
k = a[j];
Are these two lines a definition-use pair? They are if the values of i and j are equal,
which might be true on some executions and not on others. A static analysis cannot, in
general, determine whether they are always, sometimes, or never equal, so a source of
imprecision is necessarily introduced into data flow analysis.
Pointers and object references introduce the same issue, often in less obvious ways.
Consider the following snippet:
1
2
a[2] = 42;
i = b[2];
It seems that there cannot possibly be a definition-use pair involving these two lines,
since they involve none of the same variables. However, arrays in Java are dynamically
allocated objects accessed through pointers. Pointers of any kind introduce the
possibility of aliasing, that is, of two different names referring to the same storage
location. For example, the two lines above might have been part of the following
program fragment:
1
2
3
4
Here a and b are aliases, two different names for the same dynamically allocated array
object, and an assignment to part of a is also an assignment to part of b.
The same phenomenon, and worse, appears in languages with lower-level pointer
manipulation. Perhaps the most egregious example is pointer arithmetic in C:
1
2
p=&b;
*(p+i)=k;
It is impossible to know which variable is defined by the second line. Even if we know
the value of i, the result is dependent on how a particular compiler arranges variables in
memory.
Dynamic references and the potential for aliasing introduce uncertainty into data flow
analysis. In place of a definition or use of a single variable, we may have a potential
definition or use of a whole set of variables or locations that could be aliases of each
other. The proper treatment of this uncertainty depends on the use to which the analysis
will be put. For example, if we seek strong assurance that v is always initialized before it
is used, we may not wish to treat an assignment to a potential alias of v as initialization,
but we may wish to treat a use of a potential alias of v as a use of v.
A useful mental trick for thinking about treatment of aliases is to translate the uncertainty
introduced by aliasing into uncertainty introduced by control flow. After all, data flow
analysis already copes with uncertainty about which potential execution paths will
actually be taken; an infeasible path in the control flow graph may add elements to an
any-paths analysis or remove results from an all-paths analysis. It is usually appropriate
to treat uncertainty about aliasing consistently with uncertainty about control flow. For
example, considering again the first example of an ambiguous reference:
1
2
a[i] = 13;
k = a[j];
a[i] = 13;
if (i == j) {
k = a[i];
} else {
k = a[j];
}
In the (imaginary) transformed code, we could treat all array references as distinct,
because the possibility of aliasing is fully expressed in control flow. Now, if we are using
an any-path analysis like reaching definitions, the potential aliasing will result in creating
a definition-use pair. On the other hand, an assignment to a[j] would not kill a previous
assignment to a[i]. This suggests that, for an any-path analysis, gen sets should include
everything that might be referenced, but kill sets should include only what is definitely
referenced.
If we were using an all-paths analysis, like available expressions, we would obtain a
different result. Because the sets of available expressions are intersected where control
flow merges, a definition of a[i] would make only that expression, and none of its
potential aliases, available. On the other hand, an assignment to a[j] would kill a[i]. This
suggests that, for an all-paths analysis, gen sets should include only what is definitely
referenced, but kill sets should include all the possible aliases.
Even in analysis of a single procedure, the effect of other procedures must be
considered at least with respect to potential aliases. Consider, for example, this
fragment of a Java method:
1
2
3
4
5
6
7
We cannot determine whether the two arguments fromCust and toCust are references
to the same object without looking at the context in which this method is called.
Moreover, we cannot determine whether fromHome and fromWork are (or could be)
references to the same object without more information about how CustInfo objects are
treated elsewhere in the program.
Sometimes it is sufficient to treat all nonlocal information as unknown. For example, we
could treat the two CustInfo objects as potential aliases of each other, and similarly treat
the four PhoneNum objects as potential aliases. Sometimes, though, large sets of
aliases will result in analysis results that are so imprecise as to be useless. Therefore
data flow analysis is often preceded by an interprocedural analysis to calculate sets of
aliases or the locations that each pointer or reference can refer to.
Figure 6.13: Spurious execution paths result when procedure calls and returns are
treated as normal edges in the control flow graph. The path (A,X,Y,D) appears in the
combined graph, but it does not correspond to an actual execution
order.
It is possible to represent procedure calls and returns precisely, for example by making
a copy of the called procedure for each point at which it is called. This would result in a
context-sensitive analysis. The shortcoming of context sensitive analysis was already
mentioned in the previous chapter: The number of different contexts in which a
procedure must be considered could be exponentially larger than the number of
procedures. In practice, a context-sensitive analysis can be practical for a small group of
closely related procedures (e.g., a single Java class), but is almost never a practical
option for a whole program.
Some interprocedural properties are quite independent of context and lend themselves
naturally to analysis in a hierarchical, piecemeal fashion. Such a hierarchical analysis can
be both precise and efficient. The analyses that are provided as part of normal
compilation are often of this sort. The unhandled exception analysis of Java is a good
example: Each procedure (method) is required to declare the exceptions that it may
throw without handling. If method M calls method N in the same or another class, and if
N can throw some exception, then M must either handle that exception or declare that it,
too, can throw the exception. This analysis is simple and efficient because, when
analyzing method M, the internal structure of N is irrelevant; only the results of the
analysis at N (which, in Java, is also part of the signature of N) are needed.
Two conditions are necessary to obtain an efficient, hierarchical analysis like the
exception analysis routinely carried out by Java compilers. First, the information needed
to analyze a calling procedure must be small: It must not be proportional either to the
size of the called procedure, or to the number of procedures that are directly or
indirectly called. Second, it is essential that information about the called procedure be
independent of the caller; that is, it must be context-independent. When these two
conditions are true, it is straightforward to develop an efficient analysis that works
upward from leaves of the call graph. (When there are cycles in the call graph from
recursive or mutually recursive procedures, an iterative approach similar to data flow
analysis algorithms can usually be devised.)
Unfortunately, not all important properties are amenable to hierarchical analysis.
Potential aliasing information, which is essential to data flow analysis even within
individual procedures, is one of those that are not. We have seen that potential aliasing
can depend in part on the arguments passed to a procedure, so it does not have the
context-independence property required for an efficient hierarchical analysis. For such
an analysis, additional sacrifices of precision must be made for the sake of efficiency.
Even when a property is context-dependent, an analysis for that property may be
context-insensitive, although the context-insensitive analysis will necessarily be less
precise as a consequence of discarding context information. At the extreme, a linear
time analysis can be obtained by discarding both context and control flow information.
Context- and flow-insensitive algorithms for pointer analysis typically treat each
statement of a program as a constraint. For example, on encountering an assignment
1
x=y;
where y is a pointer, such an algorithm simply notes that x may refer to any of the same
objects that y may refer to. References(x) References(y) is a constraint that is
completely independent of the order in which statements are executed. A procedure call,
in such an analysis, is just an assignment of values to arguments. Using efficient data
structures for merging sets, some analyzers can process hundreds of thousands of lines
of source code in a few seconds. The results are imprecise, but still much better than
the worst-case assumption that any two compatible pointers might refer to the same
object.
The best approach to interprocedural pointer analysis will often lie somewhere between
the astronomical expense of a precise, context- and flow-sensitive pointer analysis and
the imprecision of the fastest context- and flow-insensitive analyses. Unfortunately, there
is not one best algorithm or tool for all uses. In addition to context and flow sensitivity,
important design trade-offs include the granularity of modeling references (e.g., whether
individual fields of an object are distinguished) and the granularity of modeling the
program heap (that is, which allocated objects are distinguished from each other).
Summary
Data flow models are used widely in testing and analysis, and the data flow analysis
algorithms used for deriving data flow information can be adapted to additional uses.
The most fundamental model, complementary to models of control flow, represents the
ways values can flow from the points where they are defined (computed and stored) to
points where they are used.
Data flow analysis algorithms efficiently detect the presence of certain patterns in the
control flow graph. Each pattern involves some nodes that initiate the pattern and some
that conclude it, and some nodes that may interrupt it. The name "data flow analysis"
reflects the historical development of analyses for compilers, but patterns may be used
to detect other control flow patterns.
An any-path analysis determines whether there is any control flow path from the initiation
to the conclusion of a pattern without passing through an interruption. An all- paths
analysis determines whether every path from the initiation necessarily reaches a
concluding node without first passing through an interruption. Forward analyses check
for paths in the direction of execution, and backward analyses check for paths in the
opposite direction. The classic data flow algorithms can all be implemented using simple
work-list algorithms.
A limitation of data flow analysis, whether for the conventional purpose or to check other
properties, is that it cannot distinguish between a path that can actually be executed and
a path in the control flow graph that cannot be followed in any execution. A related
limitation is that it cannot always determine whether two names or expressions refer to
the same object.
Fully detailed data flow analysis is usually limited to individual procedures or a few
closely related procedures (e.g., a single class in an object-oriented program). Analyses
that span whole programs must resort to techniques that discard or summarize some
information about calling context, control flow, or both. If a property is independent of
calling context, a hierarchical analysis can be both precise and efficient. Potential
aliasing is a property for which calling context is significant. There is therefore a tradeoff between very fast but imprecise alias analysis techniques and more precise but much
more expensive techniques.
Further Reading
Data flow analysis techniques were originally developed for compilers, as a systematic
way to detect opportunities for code-improving transformations and to ensure that those
transformations would not introduce errors into programs (an all-too-common experience
with early optimizing compilers). The compiler construction literature remains an
important source of reference information for data flow analysis, and the classic "Dragon
Book" text [ASU86] is a good starting point.
Fosdick and Osterweil recognized the potential of data flow analysis to detect program
errors and anomalies that suggested the presence of errors more than two decades ago
[FO76]. While the classes of data flow anomaly detected by Fosdick and Osterweil's
system has largely been obviated by modern strongly typed programming languages,
they are still quite common in modern scripting and prototyping languages. Olender and
Osterweil later recognized that the power of data flow analysis algorithms for
recognizing execution patterns is not limited to properties of data flow, and developed a
system for specifying and checking general sequencing properties [OO90, OO92].
Interprocedural pointer analyses - either directly determining potential aliasing relations,
or deriving a "points-to" relation from which aliasing relations can be derived - remains
an area of active research. At one extreme of the cost-versus-precision spectrum of
analyses are completely context- and flow-insensitive analyses like those described by
Steensgaard [Ste96]. Many researchers have proposed refinements that obtain
significant gains in precision at small costs in efficiency. An important direction for future
work is obtaining acceptably precise analyses of a portion of a large program, either
because a whole program analysis cannot obtain sufficient precision at acceptable cost
or because modern software development practices (e.g., incorporating externally
developed components) mean that the whole program is never available in any case.
Rountev et al. present initial steps toward such analyses [RRL99]. A very readable
overview of the state of the art and current research directions (circa 2001) is provided
by Hind [Hin01].
Exercises
For a graph G =(N,V ) with a root r N, node m dominates node n if every path
from r to n passes through m. The root node is dominated only by itself.
The relation can be restated using flow equations.
1. When dominance is restated using flow equations, will it be stated in the
form of an any-path problem or an all-paths problem? Forward or
backward? What are the tokens to be propagated, and what are the gen
and kill sets?
2. Give a flow equation for Dom(n).
3. If the flow equation is solved using an iterative data flow analysis, what
should the set Dom(n) be initialized to at each node n?
The first line of input to your program is an integer between 1 and 100
indicating the number k of nodes in the graph. Each subsequent line of
input will consist of two integers, m and n, representing an edge from
node m to node n. Node 0 designates the root, and all other nodes are
designated by integers between 0 and k 1. The end of the input is
signaled by the pseudo-edge (1,1).
The output of your program should be a sequences of lines, each
containing two integers separated by blanks. Each line represents one
edge of the Dom relation of the input graph.
5. The Dom relation itself is not a tree. The immediate dominators relation is
a tree. Write flow equations to calculate immediate dominators, and then
modify the program from part (d) to compute the immediate dominance
relation.
Overview
Symbolic execution builds predicates that characterize the conditions
under which execution paths can be taken and the effect of the
execution on program state. Extracting predicates through symbolic
execution is the essential bridge from the complexity of program
behavior to the simpler and more orderly world of logic. It finds
important applications in program analysis, in generating test data,
and in formal verification[1] (proofs) of program correctness.
Conditions under which a particular control flow path is taken can be
determined through symbolic execution. This is useful for identifying
infeasible program paths (those that can never be taken) and paths
that could be taken when they should not. It is fundamental to
generating test data to execute particular parts and paths in a
program.
Deriving a logical representation of the effect of execution is essential in methods that
compare a program's possible behavior to a formal specification. We have noted in earlier
chapters that proving the correctness of a program is seldom an achievable or useful goal.
Nonetheless the basic methods of formal verification, including symbolic execution, underpin
practical techniques in software analysis and testing. Symbolic execution and the techniques
of formal verification find use in several domains:
Rigorous proofs of properties of (small) critical subsystems, such as a safety kernel
of a medical device;
1
2 /** Binary search for key in sorted array dictKeys, returni
3 * corresponding value from dictValues or null if key does
4 * not appear in dictKeys. Standard binary search algorithm
5 * as described in any elementary text on data structures an
6 **/
7
8 char * binarySearch( char *key, char *dictKeys[ ], char *di
9
int dictSize) {
10
11
int low=0;
12
int high = dictSize - 1;
13
int mid;
14
int comparison;
15
16
while (high >=low) {
17
mid = (high + low) / 2;
18
comparison = strcmp( dictKeys[mid], key );
19
if (comparison < 0) {
20
/* dictKeys[mid] too small; look higher */
21
low=mid+1;
22
} else if ( comparison > 0) {
23
/* dictKeys[mid] too large; look lower */
24
high=mid-1;
25
} else {
26
/* found */
27
return dictValues[mid];
28
}
29
}
30
return 0; /* null means not found */
31 }
32
Figure 7.2: Hand-tracing an execution step with concrete values (left) and symbolic
values (right).
When tracing execution with concrete values, it is clear enough what to do with a branch
statement, for example, an if or while test: The test predicate is evaluated with the
current values, and the appropriate branch is taken. If the values bound to variables are
symbolic expressions, however, both the True and False outcomes of the decision may
be possible. Execution can be traced through the branch in either direction, and
execution of the test is interpreted as adding a constraint to record the outcome. For
example, consider
If we trace execution of the test assuming a True outcome (leading to a second iteration
of the loop), the loop condition becomes a constraint in the symbolic state immediately
after the while test:
Later, when we consider the branch assuming a False outcome of the test, the new
constraint is negated and becomes
or, equivalently,
.
Execution can proceed in this way down any path in the program. One can think of
"satisfying" the predicate by finding concrete values for the symbolic variables that make
it evaluate to True; this corresponds to finding data values that would force execution of
that program path. If no such satisfying values are possible, then that execution path
cannot be executed with any data values; we say it is an infeasible path.
is necessary to execute the path, but it may not be sufficient. Showing that W cannot be
satisfied is still tantamount to showing that the execution path is infeasible.
Here we interpret s t for strings as indicating lexical order consistent with the C library
strcmp; that is, we assume that s t whenever strcmp(s,t) 0. For convenience we
will abbreviate the predicate above as sorted.
We can associate the following assertion with the while statement at line 16:
In other words, we assert that the key can appear only between low and high,ifit
appears anywhere in the array. We will abbreviate this condition as inrange.
Inrange must be true when we first reach the loop, because at that point the range
lowhigh is the same as 0size 1. For each path through the body of the loop, the
symbolic executor would begin with the invariant assertion above, and determine that it
is true again after following that path. We say the invariant is preserved.
While the inrange predicate should be true on each iteration, it is not the complete loop
invariant. The sorted predicate remains true and will be used in reasoning. In principle it
is also part of the invariant, although in informal reasoning we may not bother to write it
down repeatedly. The full invariant is therefore sorted inrange.
Let us consider the path from line 16 through line 21 and back to the loop test. We begin
by assuming that the loop invariant assertion holds at the beginning of the segment.
Where expressions in the invariant refer to program variables whose values may
change, they are replaced by symbols representing the initial values of those variables.
The variable bindings will be
We need not introduce symbols to represent the values of dictKeys, dictVals, key,or
size. Since those variables are not changed in the procedure, we can use the variable
names directly. The condition, instantiated with symbolic values, will be
Passing through the while test into the body of the loop adds the clause H L to this
condition. Execution of line 17 adds a binding of (H + L)/2 to variable mid, where x
is the integer obtained by rounding x toward zero. As we have discussed, this can be
simplified with an assertion so that the bindings and condition become
Tracing the execution path into the first branch of the if statement to line 21, we add the
constraint that strcmp(dictKeys[mid], key) returns a negative value, which we interpret
as meaning the probed entry is lexically less than the string value of the key. Thus we
arrive at the symbolic constraint
The assignment in line 21 then modifies a variable binding without otherwise disturbing
the conditions, giving us
Finally, we trace execution back to the while test at line 16. Now our obligation is to
show that the invariant still holds when instantiated with the changed set of variable
bindings. The sorted condition has not changed, and showing that it is still true is trivial.
The interesting part is the inrange predicate, which is instantiated with a new value for
low and thus becomes
Now the verification step is to show that this predicate is a logical consequence of the
predicate describing the program state. This step requires purely logical and
mathematical reasoning, and might be carried out either by a human or by a theoremproving tool. It no longer depends in any way upon the program. The task performed by
the symbolic executor is essentially to transform a question about a program (is the
invariant preserved on a particular path?) into a question of logic alone.
The path through the loop on which the probed key is too large, rather than too small,
proceeds similarly. The path on which the probed key matches the sought key returns
from the procedure, and our obligation there (trivial in this case) is to verify that the
contract of the procedure has been met.
The other exit from the procedure occurs when the loop terminates without locating a
matching key. The contract of the procedure is that it should return the null pointer
(represented in the C language by 0) only if the key appears nowhere in
dictKeys[0..size-1]. Since the null pointer is returned whenever the loop terminates, the
postcondition of the loop is that key is not present in dictKeys.
The loop invariant is used to show that the postcondition holds when the loop terminates.
What symbolic execution can verify immediately after a loop is that the invariant is true
but the loop test is false. Thus we have
Knowing that presence of the key in the array implies L H, and that in fact L > H, we
can conclude that the key is not present. Thus the postcondition is established, and the
procedure fulfills its contract by returning the null pointer in this case.
Finding and verifying a complete set of assertions, including an invariant assertion for
each loop, is difficult in practice. Even the small example above is rather tedious to verify
by hand. More realistic examples can be quite demanding even with the aid of symbolic
execution tools. If it were easy or could be fully automated, we might routinely use this
method to prove the correctness of programs. Writing down a full set of assertions
formally, and rigorously verifying them, is usually reserved for small and extremely
critical modules, but the basic approach we describe here can also be applied in a much
less formal manner and is quite useful in finding holes in an informal correctness
argument.
The meaning of this triple is that if the program is in a state satisfying the precondition
pre at entry to the block, then after execution of the block it will be in a state satisfying
the postcondition post.
There are standard templates, or schemata, for reasoning with triples. In the previous
section we were following this schema for reasoning about while loops:
The formula above the line is the premise of an inference, and the formula below the line
is the conclusion. An inference rule states that if we can verify the premise, then we can
infer the conclusion. The premise of this inference rule says that the loop body preserves
invariant I: If the invariant I is true before the loop, and if the condition C governing the
loop is also true, then the invariant is established again after executing the loop body S.
The conclusion says that the loop as a whole takes the program from a state in which
the invariant is true to a state satisfying a postcondition composed of the invariant and
the negation of the loop condition.
The important characteristic of these rules is that they allow us to compose proofs about
small parts of the program into proofs about larger parts. The inference rule for while
allows us to take a triple about the body of a loop and infer a triple about the whole
loop. There are similar rules for building up triples describing other kinds of program
This style of reasoning essentially lets us summarize the effect of a block of program
code by a precondition and a postcondition. Most importantly, we can summarize the
effect of a whole procedure in the same way. The contract of the procedure is a
precondition (what the calling client is required to provide) and a postcondition (what the
called procedure promises to establish or return). Once we have characterized the
contract of a procedure in this way, we can use that contract wherever the procedure is
called. For example, we might summarize the effect of the binary search procedure this
way:
Explicit consideration of the abstract model, abstraction function, and structural invariant
of a class or other data structure model is the basis not only of formal or informal
reasoning about correctness, but also of designing test cases and test oracles.
Summary
Symbolic execution is a bridge from an operational view of program execution to logical
and mathematical statements. The basic symbolic execution technique is like hand
execution using symbols rather than concrete values. To use symbolic execution for
loops, procedure calls, and data structures encapsulated in modules (e.g., classes), it is
necessary to proceed hierarchically, composing facts about small parts into facts about
larger parts. Compositional reasoning is closely tied to strategies for specifying intended
behavior.
Symbolic execution is a fundamental technique that finds many different applications.
Test data generators use symbolic execution to derive constraints on input data. Formal
verification systems combine symbolic execution to derive logical predicates with
theorem provers to prove them. Many development tools use symbolic execution
techniques to perform or check program transformations, for example, unrolling a loop
for performance or refactoring source code.
Human software developers can seldom carry out symbolic execution of program code
in detail, but often use it (albeit informally) for reasoning about algorithms and data
structure designs. The approach to specifying preconditions, postconditions, and
invariants is also widely used in programming, and is at least partially supported by tools
for run-time checking of assertions.
Further Reading
The techniques underlying symbolic execution were developed by Floyd [Flo67] and
Hoare [Hoa69], although the fundamental ideas can be traced all the way back to Turing
and the beginnings of modern computer science. Hantler and King [HK76] provide an
excellent clear introduction to symbolic execution in program verification. Kemmerer and
Eckman [KE85] describe the design of an actual symbolic execution system, with
discussion of many pragmatic details that are usually glossed over in theoretical
descriptions.
Generation of test data using symbolic execution was pioneered by Clarke [Cla76], and
Howden [How77, How78] described an early use of symbolic execution to test
programs. The PREfix tool described by Bush, Pincus, and Sielaff [BPS00] is a modern
application of symbolic testing techniques with several refinements and simplifications for
adequate performance on large programs.
Exercises
We introduce symbols to represent variables whose value may change, but we do
not bother to introduce symbols for variables whose value remains unchanged in
7.1 the code we are symbolically executing. Why are new symbols necessary in the
former case but not in the latter?
Demonstrate that the statement return dictValues[mid] at line 27 of the binary
7.2 search program of Figure 7.1 always returns the value of the input key.
Compute an upper bound to the number of iterations through the while loop of the
7.3 binary search program of Figure 7.1.
The body of the loop of the binary search program of Figure 7.1 can be modified
as follows:
7.4
1
2
3
4
5
6
7
8
9
10
11
12
if (comparison < 0) {
/* dictKeys[mid] too small; look higher */
low=mid+1;
}
if ( comparison > 0) {
/* dictKeys[mid] too large; look lower */
high=mid-1;
}
if (comparison=0) {
/* found */
return dictValues[mid];
}
Demonstrate that the path that traverses the false branch of all three statements is
infeasible.
Write the pre- and postconditions for a program that finds the index of the
7.5 maximum element in a nonempty set of integers.
8.1 Overview
Most important properties of program execution are undecidable in general, but finite
state verification can automatically prove some significant properties of a finite model of
the infinite execution space. Of course, there is no magic: We must carefully reconcile
and balance trade-offs among the generality of the properties to be checked, the class
of programs or models that can be checked, computational effort, and human effort to
use the techniques.
Symbolic execution and formal reasoning can prove many properties of program
behavior, but the power to prove complex properties is obtained at the cost of devising
complex conditions and invariants and expending potentially unbounded computational
effort. Construction of control and data flow models, on the other hand, can be fully and
efficiently automated, but is typically limited to very simple program properties. Finite
state verification borrows techniques from symbolic execution and formal verification, but
like control and data flow analysis, applies them to models that abstract the potentially
infinite state space of program behavior into finite representations. Finite state
verification techniques fall between basic flow analyses and full-blown formal verification
in the richness of properties they can address and in the human guidance and
computational effort they require.
Since even simple properties of programs are undecidable in general, one cannot expect
an algorithmic technique to provide precise answers in all cases. Often finite state
verification is used to augment or substitute for testing when the optimistic inaccuracy of
testing (due to examining only a sample of the program state space) is unacceptable.
Techniques are therefore often designed to provide results that are tantamount to formal
proofs of program properties. In trade for this assurance, both the programs and
properties that can be checked are severely restricted. Restrictions on program
constructs typically appear in procedures for deriving a finite state model from a
program, generating program code from a design model, or verifying consistency
between a program and a separately constructed model.
Finite state verification techniques include algorithmic checks, but it is misleading to
characterize them as completely automated. Human effort and considerable skill are
usually required to prepare a finite state model and a suitable specification for the
automated analysis step. Very often there is an iterative process in which the first
several attempts at verification produce reports of impossible or unimportant faults,
which are addressed by repeatedly refining the specification or the model.
The automated step can be computationally costly, and the computational cost can
impact the cost of preparing the model and specification. A considerable amount of
manual effort may be expended just in obtaining a model that can be analyzed within
available time and memory, and tuning a model or specification to avoid combinatorial
explosion is itself a demanding task. The manual task of refining a model and
testing, treating the model as a kind of specification. The combination of finite state
verification and conformance testing is often more effective than directly testing for the
property of interest, because a discrepancy that is easily discovered in conformance
testing may very rarely lead to a run-time violation of the property (e.g., it is much easier
to detect that a particular lock is not held during access to a shared data structure than
to catch the occasional data race that the lock protects against).
A property to be checked can be implicit in a finite state verification tool (e.g., a tool
specialized just for detecting potential null pointer references), or it may be expressed in
a specification formalism that is purposely limited to a class of properties that can be
effectively verified using a particular checking technique. Often the real property of
interest is not amenable to efficient automated checking, but a simpler and more
restrictive property is. That is, the property checked by a finite state verification tool may
be sufficient but not necessary for the property of interest. For example, verifying
freedom from race conditions on a shared data structure is much more difficult than
verifying that some lock is always held by threads accessing that structure; the latter is
a sufficient but not necessary condition for the former. This means that we may exclude
correct software that we are not able to verify, but we can be sure that the accepted
software satisfies the property of interest.
[1]Note
that one may independently derive several different models from one program,
but deriving one program from several different models is much more difficult.
43
44
45
46
}
47
60 ...
61 }
}
}
return theValues[i].getX() + theValues[i].getY();
Figure 8.3: Finite state models of individual threads executing the lookup and reInit
methods from Figure 8.2. Each state machine may be replicated to represent
concurrent threads executing the same method.
Java threading rules ensure that in a system state in which one thread has obtained a
monitor lock, the other thread cannot make a transition to obtain the same lock. We can
observe that the locking prevents both threads from concurrently calling the initialize
method. However, another race condition is possible, between two concurrent threads
each executing the lookup method.
Tracing possible executions by hand - "desk checking" multi-threaded execution - is
capable in principle of finding the race condition between two concurrent threads
executing the lookup method, but it is at best tedious and in general completely
impractical. Fortunately, it can be automated, and many state space analysis tools can
explore millions of states in a short time. For example, a model of the faulty code from
Figure 8.2 was coded in the Promela modeling language and submitted to the Spin
verification tool. In a few seconds, Spin systematically explored the state space and
reported a race condition, as shown in Figure 8.5.
Depth=
10 States=
51 Transitions=
92 Memory= 2.30
pan: assertion violated !(modifying) (at depth 17)
pan: wrote pan_in.trail
(Spin Version 4.2.5 -- 2 April 2005)
...
0.16 real
0.00 user
0.03 sys
Figure 8.5: Excerpts of Spin verification tool transcript. Spin has performed a depthfirst search of possible executions of the model, exploring 10 states and 51 state
transitions in 0.16 seconds before finding a sequence of 17 transitions from the initial
state of the model to a state in which one of the assertions in the model evaluates to
False.
A few seconds of automated analysis to find a critical fault that can elude extensive
testing seems a very attractive option. Indeed, finite state verification should be a key
component of strategies for eliminating faults in multi-threaded and distributed programs,
as well as some kinds of security problems (which are similarly resistant to systematic
sampling in conventional program testing) and some other domains. On the other hand,
we have so far glossed over several limitations and problems of state space exploration,
each of which also appears in other forms of finite state verification. We will consider
two fundamental and related issues in the following sections: the size of the state space
to be explored, and the challenge of obtaining a model that is sufficiently precise without
making the state space explosion worse.
in Figure 8.4 can be used to represent acquiring a monitor lock, because execution
blocks at this point until locked has the value False. The guard is enclosed in an
atomic block to prevent another process taking the lock between evaluation of the
guard condition and execution of the statement.
The concept of enabling or blocking in guarded commands is used in conditional and
looping constructs. Alternatives in an if fi construct, marked syntactically with ::,
begin with guarded commands. If none of the alternatives is enabled (all of the guards
evaluate to False), then the whole if construct blocks. If more than one of the guarded
alternatives is enabled, the if construct does not necessarily choose the first among
them, as a programmer might expect from analogous if else if constructs in
conventional programming languages. Any of the enabled alternatives can be
nondeterministically chosen for execution; in fact the Spin tool will consider the
possible consequences of each choice. The do od construct similarly chooses
nondeterministically among enabled alternatives, but repeats until a break or goto is
evaluated in one of the guarded commands.
The simplest way to check properties of a Promela model is with assertions, like the
two assert statements in Figure 8.4. Spin searches for any possible execution
sequence in which an assertion can be violated. Sequencing properties can also be
specified in the form of temporal logic formulas, or encoded as state machines.
Figure 8.6: A Spin guided simulation trace describes each of the 17 steps from the
initial model state to the state in which the assertion !(modifying) is violated. For
example, in step 8, one of the two processes (threads) simulating execution of the
Lookup method sets the global variable modifying to True, represented as the integer
value 1. A graphical representation of this trace is presented in Figure 8.7.
Figure 8.7: A graphical interpretation of Spin guided simulation output (Figure 8.6) in
terms of Java source code (Figure 8.2) and state machines (Figure
8.3).
Safety and Liveness Properties
Properties of concurrent systems can be divided into simple safety properties,
sequencing safety properties, and liveness properties.
Simple safety properties divide states of the system into "good" (satisfying the
property) and "bad" (violating the property). They are easiest to specify, and least
expensive to check, because we can simply provide a predicate to be evaluated at
each state. Often simple safety properties relate the local state of one process to
local states of other processes. For example, the assertion assert(! modifying) in the
Promela code of Figure 8.4 states a mutual exclusion property between two
instances of the lookup process. When simple safety properties are expressed in
temporal logic, they have the form p, where p is a simple predicate with no
temporal modalities.
Safety properties about sequences of events are similar, but treat the history of
events preceding a state as an attribute of that state. For example, an assertion that
two operations a and b strictly alternate is a safety property about the history of
those events; a "bad" state is one in which a or b is about to be performed out of
order. Sequencing properties can be specified in temporal logic, but do not require it:
They are always equivalent to simple safety properties embedded in an "observer"
process. Checking a sequencing property adds the same degree of complexity to the
verification process as adding an explicit observer process, whether there is a real
observer (which is straightforward to encode for some kinds of model, and nearly
impossible for others) or whether the observer is implicit in the checking algorithm (as
it would be using a temporal logic predicate with the Spin tool).
True liveness properties, sometimes called "eventuality" properties, are those that
can only be violated by an infinite length execution. For example, if we assert that p
must eventually be true (p), the assertion is violated only by an execution that runs
forever with p continuously false. Liveness properties are useful primarily as a way of
abstracting over sequences of unknown length. For example, fairness properties are
an important class of liveness properties. When we say, for example, that a mutual
exclusion protocol must be fair, we do not generally mean that all processes have an
equal chance to obtain a resource; we merely assert that no process can be starved
forever. Liveness properties (including fairness properties) must generally be stated in
temporal logic, or encoded directly in a Bchi automaton that appears similar to a
deterministic finite state acceptor but has different rules for acceptance. A finite state
verification tool finds violations of liveness properties by searching for execution loops
in which the predicate that should eventually be true remains false; this adds
considerably to the computational cost of verification.
A common mnemonic for safety and liveness is that safety properties say "nothing
bad happens," while liveness properties say "something good eventually happens."
Properties involving real time (e.g., "the stop signal is sent within 5 seconds of
receiving the damage signal") are technically safety properties in which the "bad thing"
is expiration of a timer. However, naive models involving time are so expensive that it
is seldom practical to simply add a clock to a model and use simple safety properties.
Usually it is best to keep reasoning about time separate from verifying untimed
properties with finite state verification.
[2]In
fact even a correctly implemented double-check pattern can fail in Java due to
properties of the Java memory model, as discussed below.
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
rightFork!Down;
leftFork!Down;
myState = Thinking;
od;
}
#define NumSeats 10
chan forkInterface[NumSeats] = [0] of {mtype} ;
init {
int i=0;
do :: i < NumSeats ->
run fork( forkInterface[i] );
i=i+1;
:: i >= NumSeats -> break;
od;
i=0;
do :: i < NumSeats ->
run philosopher( forkInterface[i], forkInterface[
i=i+1;
:: i >= NumSeats-1 -> break;
od;
}
Figure 8.8: The classic dining philosophers problem in Promela. The number of
unique states explored before finding the potential deadlock (with default settings)
grows from 145 with 5 philosophers, to 18,313 with 10 philosophers, to 148,897 with
15 philosophers.
The known complexity results strongly imply that, in the worst case, no finite state
verification technique can be practical. Worst case complexity results, however, say
nothing about the performance of verification techniques on typical problems. Experience
with a variety of automated techniques tells us a fair amount about what to expect: Many
techniques work very well when applied on well-designed models, within a limited
domain, but no single finite state verification technique gives satisfactory results on all
problems. Moreover, crafting a model that accurately and succinctly captures the
essential structure of a system, and that can be analyzed with reasonable performance
by a given verification tool, requires creativity and insight as well as understanding of the
verification approach used by that tool.
An Illustration of State Space Explosion
[3]It
is a useful exercise to try this, because even though the number of reachable states
is quite small, it is remarkably difficult to enumerate them by hand without making
mistakes. Programmers who attempt to devise clever protocols for concurrent operation
face the same difficulty, and if they do not use some kind of automated formal
verification, it is not an exaggeration to say they almost never get it right.
and more effective than testing alone. Consider again our simple example of
misapplication of the double-check pattern in Figure 8.2. Tens of thousands of test
executions can fail to reveal the race condition in this code, depending on the way
threads are scheduled on a particular hardware platform and Java virtual machine
implementation. Testing for a discrepancy between model and program, on the other
hand, is fairly straightforward because the model of each individual state machine can be
checked independently (in fact all but one are trivial). The complexity that stymies testing
comes from nondeterministic interleaving of their execution, but this interleaving is
completely irrelevant to conformance testing.
28
29
30
31
32 }
Figure 8.9: A simple data race in Java. The possible ending values of i depend on
how the statement i = i+1 in one thread is interleaved with the same sequence in the
other thread.
Figure 8.10: Coarse and fine-grain models of the same program from Figure 8.9. In
the coarse-grain model, i will be increased by 2, but other outcomes are possible in
the finer grain model in which the shared variable i is loaded into temporary variable
or register, updated locally, and then stored.
Figure 8.11: The lost update problem, in which only one of the two increments
affects the final value of i. The illustrated sequence of operations from the program of
Figure 8.9 can be found using the finer grain model of Figure 8.10, but is not revealed
by the coarser grain model.
Even representing each memory access as an individual action is not always sufficient.
Programming language definitions usually allow compilers to perform some
rearrangements in the order of instructions. What appears to be a simple store of a
value into a memory cell may be compiled into a store into a local register, with the
actual store to memory appearing later (or not at all, if the value is replaced first). Two
loads or stores to different memory locations may also be reordered for reasons of
efficiency. Moreover, when a machine instruction to store a value into memory is
executed by a parallel or distributed computer, the value may initially be placed in the
cache memory of a local processor, and only later written into a memory area accessed
by other processors. These reorderings are not under programmer control, nor are they
directly visible, but they can lead to subtle and unpredictable failures in multi-threaded
programs.
As an example, consider once again the flawed program of Figure 8.2. Suppose we
corrected it to use the double-check idiom only for lazy initialization and not for updates
of the data structure. It would still be wrong, and unfortunately it is unlikely we would
discover the flaw through finite state verification. Our model in Promela assumes that
memory accesses occur in the order given in the Java program, but Java does not
guarantee that they will be executed in that order. In particular, while the programmer
may assume that initialization invoked in line 18 of the Java program is completed before
field ref is set in line 19, Java makes no such guarantee.
Breaking sequences of operations into finer pieces exacerbates the state explosion
problem, but as we have seen, making a model too coarse risks failure to detect some
possible errors. Moreover, conformance testing may not be much help in determining
whether a model depends on unjustified assumptions of atomicity. Interruptions in a
sequence of program operations that are mistakenly modeled as an atomic action may
not only be extremely rare and dependent on uncontrolled features of the execution
environment, such as system load or the activity of connected devices, but may also
depend on details of a particular language compiler.
Conformance testing is not generally effective in detecting that a finite state model of a
program relies on unwarranted assumptions of atomicity and ordering of memory
accesses, particularly when those assumptions may be satisfied by one compiler or
machine (say, in the test environment) and not by another (as in the field). Tools for
extracting models, or for generating code from models, have a potential advantage in
that they can be constructed to assume no more than is actually guaranteed by the
programming language.
Many state space analysis tools will attempt to dynamically determine when a sequence
of operations in one process can be treated as if it were atomic without affecting the
results of analysis. For example, the Spin verification tool uses a technique called partial
order reduction to recognize when the next event from one process can be freely
reordered with the next event from another, so only one of the orders need be checked.
Many finite state verification tools provide analogous facilities, and though they cannot
completely compensate for the complexity of a model that is more fine-grained than
necessary, they reduce the penalty imposed on the cautious model-builder.
The extensional representation, given above, lists the elements of the set. The same set
can be represented intensionally as
The predicate x mod 2 = 0 0 < x < 20, which is true for elements included in the set
and false for excluded elements, is called a characteristic function. The length of the
representation of the characteristic function does not necessarily grow with the size of
the set it describes. For example, the set
contains four times as many elements as the one above, and yet the length of the
representation is the same.
It could be advantageous to use similarly compact representations for sets of reachable
states and transitions among them. For example, ordered binary decision diagrams
(OBDDs) are a representation of Boolean functions that can be used to describe the
characteristic function of a transition relation. Transitions in the model state space are
pairs of states (the state before and the state after executing the transition), and the
Boolean function represented by the OBDD takes a pair of state descriptions and
returns True exactly if there is a transition between such a pair of states. The OBDD is
built by an iterative procedure that corresponds to a breadth-first expansion of the state
space (i.e., creating a representation of the whole set of states reachable in k + 1 steps
from the set of states reachable in k steps). If the OBDD representation does not grow
too large to be manipulated in memory, it stabilizes when all the transitions that can
occur in the next step are already represented in the OBDD form.
Finding a compact intensional representation of the model state space is not, by itself,
enough. In addition we must have an algorithm for determining whether that set satisfies
the property we are checking. For example, an OBDD can be used to represent not only
the transition relation of a set of communicating state machines, but also a class of
Figure 8.12: Ordered binary decision diagram (OBDD) encoding of the Boolean
proposition a b c, which is equivalent to a (b c). The formula and OBDD
structure can be thought of as a function from the Boolean values of a, b, and c to a
single Boolean value True or False.
However, M is only an approximation of the real system, and we find that the verification
finds a violation of P because of some execution sequences that are possible in M1 but
not in the real system. In the first approach, we examine the counter-example (an
execution trace of M1 that violates P but is impossible in the real system) and create a
new model M2 that is more precise in a way that will eliminate that particular execution
trace (and many similar traces). We attempt verification again with the refined model:
If verification fails again, we repeat the process to obtain a new model M3, and so on,
until verification succeeds with some "good enough" model Mk or we obtain a counterexample that corresponds to an execution of the actual program.
One kind of model that can be iteratively refined in this way is Boolean programs. The
initial Boolean program model of an (ordinary) program omits all variables; branches (if,
while, etc.) refer to a dummy Boolean variable whose value is unknown. Boolean
programs are refined by adding variables, with assignments and tests - but only Boolean
variables. For instance, if a counter-example produced by trying to verify a property of a
pump controller shows that the waterLevel variable cannot be ignored, a Boolean
program might be refined by adding a Boolean variable corresponding to a predicate in
which waterLevel is tested (say, waterLevel < highLimit), rather than adding the variable
waterLevel itself. For some kinds of interprocedural control flow analysis, it is possible
to completely automate the step of choosing additional Boolean variables to refine Mi
into Mi+1 and eliminate some spurious executions.
In the second approach, M remains fixed,[4] but premises that constrain executions to be
checked are added to the property P. When bogus behaviors of M violate P,we add a
constraint C1 to rule them out and try the modified verification problem:
If the modified verification problem fails because of additional bogus behaviors, we try
again with new constraints C2:
[4]In
Exclude self loops from "links" relations; that is, specify that a page should not
be directly linked to itself.
Allow at most one type of link between two pages. Note that relations need not
be symmetric; that is, the relation between A and B is distinct from the relation
between B and A, so there can be a link of type private from A to B and a link of
type public from B back to A.
Require the Web site to be connected; that is, require that there be at least one
way of following links from the home page to each other page of the site.
A data model can be visualized as a diagram with nodes corresponding to sets and
edges representing relations, as in Figure 8.14.
A B = B A
commutative law
A B = B A
" "
(A B) C = A (B C)
associative law
(A B) C = A (B C)
" "
A (B C)=(A B) (A C)
distributive law etc.
These and many other laws together make up relational algebra, which is used
extensively in database processing and has many other uses.
It would be inconvenient to write down a data model directly as a collection of
mathematical formulas. Instead, we use some notation whose meaning is the same as
the mathematical formulas, but is easier to write, maintain, and comprehend. Alloy is
one such modeling notation, with the additional advantage that it can be processed by a
finite state verification tool.
The definition of the data model as sets and relations can be formalized and verified with
relational algebra by specifying signatures and constraints. Figure 8.15 presents a
formalization of the data model of the Web site in Alloy. Keyword sig (signature)
identifies three sets: Pages, User, and Site. The definition of set Pages also defines
three disjoint relations among pages: linksPriv (private links), linksPub (public links), and
linksMain (maintenance links). The definition of User also defines a relation between
users and pages. User is partitioned into three disjoint sets (Administrator, Registered,
and Unregistered). The definition of Site aggregates pages into the site and identifies the
home page. Site is defined static since it is a fixed classification of objects.
1 module WebSite
2
3 // Pages include three disjoint sets of links
4 sig Page{ disj linksPriv, linksPub, linksMain: set Page }
5 // Each type of link points to a particular class of page
6 fact connPub{ all p: Page, s: Site | p.linksPub in s.unres }
7 fact connPriv{ all p: Page, s: Site | p.linksPriv in s.res }
8 fact connMain{ all p: Page, s: Site | p.linksMain in s.main
9 // Self loops are not allowed
10 fact noSelfLoop{ no p: Page| p in p.linksPriv+p.linksPub+p.
11
12 // Users are characterized by the set of pages that they ca
13 sig User{ pages: set Page }
14 // Users are partitioned into three sets
15 part sig Administrator, Registered, Unregistered extends Us
16 // Unregistered users can access only the home page, and un
17 fact accUnregistered{
18
all u: Unregistered, s: Site| u.pages = (s.home+s.unres)
19 // Registered users can access home, restricted and unrestr
20 fact accRegistered{
21
all u: Registered, s: Site|
22
u.pages = (s.home+s.res+s.unres)
23 }
24 // Administrators can access all pages
25 fact accAdministrator{
26
all u: Administrator, s: Site|
27
u.pages = (s.home+s.res+s.unres+s.main)
28 }
29
30 // A web site includes one home page and three disjoint set
31 // of pages: restricted, unrestricted and maintenance
32
33
34
35
36
37
38
39
Figure 8.15: Alloy model of a Web site with different kinds of pages, users, and
access rights (data model part). Continued in Figure 8.16.
1 module WebSite
39 ...
40 // We consider one Web site that includes one home page
41 // and some other pages
42 fun initSite() {
43
one s: Site| one s.home and
44
some s.res and
45
some s.unres and
46
some s.main
47 }
48
49 // We consider one administrator and some registered and un
50 fun initUsers() {one Administrator and
51
some Registered and
52
some Unregistered}
53
54 fun init() {
55
initSite() and initUsers()
56 }
57
58 // ANALYSIS
59
60 // Verify if there exists a solution
61 // with sets of cardinality at most 5
62 run init for 5
63
64 // check if unregistered users can visit all unrestrited pa
65 // i.e., all unrestricted pages are connected to the home p
66 // at least a path of public links.
67
68
69
70
71
72
73
assert browsePub{
all p: Page, s: Site| p in s.unres implies s.home in p.
}
check browsePub for 3
Figure 8.16: Alloy model of a Web site with different kinds of pages, users, and
access rights, continued from Figure 8.15.
The keyword facts introduces constraints.[6] The constraints connPub, connPriv and
connMain restrict the target of the links relations, while noSelfLoop excludes links from a
page to itself. The constraints accAdministrator, accRegistered, and accUnregistered
map users to pages. The constraint that follows the definition of Site forces the Web site
to be connected by requiring each page to belong to the transitive closure of links
starting from the Web page (operator ).
A relational algebra specification may be over- or underconstrained. Overconstrained
specifications are not satisfiable by any implementation, while underconstrained
specifications allow undesirable implementations; that is, implementations that violate
important properties.
In general, specifications identify infinite sets of solutions, each characterized by a
different set of objects and relations (e.g., the infinite set of Web sites with different sets
of pages, users and correct relations among them). Thus in general, properties of a
relational specification are undecidable because proving them would require examining
an infinite set of possible solutions. While attempting to prove absence of a solution may
be inconclusive, often a (counter) example that invalidates a property can be found within
a finite set of small models.
We can verify a specification over a finite set of solutions by limiting the cardinality of the
sets. In the example, we first verify that the model admits solutions for sets with at most
five elements (run init for 5 issued after an initialization of the system.) A positive
outcome indicates that the specification is not overconstrained - there are no logical
contradictions. A negative outcome would not allow us to conclude that no solution
exists, but tells us that no "reasonably small" solution exists.
We then verify that the example is not underconstrained with respect to property
browsePub that states that unregistered users must be able to visit all unrestricted
pages by accessing the site from the home page. The property is asserted by requiring
that all unrestricted pages belong to the reflexive transitive closure of the linkPub relation
from the home page (here we use operator * instead of because the home page is
included in the closure). If we check whether the property holds for sets with at most
three elements (check browsePub for 3) we obtain a counter-example like the one
shown in Figure 8.17, which shows how the property can be violated.
Figure 8.17: A Web site that violates the "browsability" property, because public
page Page_2 is not reachable from the home page using only unrestricted links. This
diagram was generated by the Alloy tool.
The simple Web site in the example consists of two unrestricted pages (page_1, the
home page, and Page_2), one restricted page (page_0), and one unregistered user
(user_2). User_2 cannot visit one of the unrestricted pages (Page_2) because the only
path from the home page to Page_2 goes through the restricted page page_0. The
property is violated because unrestricted browsing paths can be "interrupted" by
restricted pages or pages under maintenance, for example, when a previously
unrestricted page is reserved or disabled for maintenance by the administrator.
The problem appears only when there are public links from maintenance or reserved
pages, as we can check by excluding them:
1
2
3
fact descendant{
all p: Page, s: Site| p in s.main+s.res implies no p.linksP
}
This new specification would not find any counter-example in a space of cardinality 3.
We cannot conclude that no larger counter-example exists, but we may be satisfied that
there is no reason to expect this property to be violated only in larger models.
Summary
Finite state verification techniques fill an important niche in verifying critical properties of
programs. They are particularly crucial where nondeterminism makes program testing
ineffective, as in concurrent execution. In principle, finite state verification of concurrent
execution and of data models can be seen as systematically exploring an enormous
space of possible program states. From a user's perspective, the challenge is to
construct a suitable model of the software that can be analyzed with reasonable
expenditure of human and computational resources, captures enough significant detail
for verification to succeed, and can be shown to be consistent with the actual software.
Further Reading
There is a large literature on finite state verification techniques reaching back at least to
the 1960s, when Bartlett et al. [BSW69] employed what is recognizably a manual
version of state space exploration to justify the corrrectness of a communication
protocol. A number of early state space verification tools were developed initially for
communication protocol verification, including the Spin tool. Holzmann's journal
description of Spin's design and use [Hol97], though now somewhat out of date, remains
an adequate introduction to the approach, and a full primer and reference manual
[Hol03] is available in book form.
The ordered binary decision diagram representation of Boolean functions, used in the
first symbolic model checkers, was introduced by Randal Bryant [Bry86]. The
representation of transition relations as OBDDs in this chapter is meant to illustrate
basic ideas but is simplified and far from complete; Bryant's survey paper [Bry92] is a
good source for understanding applications of OBDDs, and Huth and Ryan [HR00]
provide a thorough and clear step-by-step description of how OBDDs are used in the
SMV symbolic model checker.
Model refinement based on iterative refinements of an initial coarse model was
introduced by Ball and Rajamani in the tools Slam [BR01a] and Bebop [BR01b], and by
Henzinger and his colleagues in Blast [HJMS03]. The complementary refinement
approach of FLAVERS was introduced by Dwyer and colleagues [DCCN04].
Automated analysis of relational algebra for data modeling was introduced by Daniel
Jackson and his students with the Alloy notation and associated tools [Jac02].
Exercises
We stated, on the one hand, that finite state verification falls between basic flow
analysis and formal verification in power and cost, but we also stated that finite
state verification techniques are often designed to provide results that are
8.1 tantamount to formal proofs of program properties. Are these two statements
contradictory? If not, how can a technique that is less powerful than formal
verification produce results that are tantamount to formal proofs?
8.3
A property like "if the button is pressed, then eventually the elevator will come" is
classified as a liveness property. However, the stronger real-time version "if the
8.4 button is pressed, then the elevator will arrive within 30 seconds" is technically a
safety property rather than a liveness property. Why?
[6]The
order in which relations and constraints are given is irrelevant. We list constraints
after the relations they refer to.
Chapter List
Chapter 9: Test Case Selection and Adequacy
Chapter 10: Functional Testing
Chapter 11: Combinatorial Testing
Chapter 12: Structural Testing
Chapter 13: Data Flow Testing
Chapter 14: Model-Based Testing
Chapter 15: Testing Object-Oriented Software
Chapter 16: Fault-Based Testing
Chapter 17: Test Execution
Chapter 18: Inspection
Chapter 19: Program Analysis
Required Background
Chapter 2
The fundamental problems and limitations of test case selection are a
consequence of the undecidability of program properties. A grasp of the basic
problem is useful in understanding Section 9.3.
9.1 Overview
Experience suggests that software that has passed a thorough set of
systematic tests is likely to be more dependable than software that
has been only superficially or haphazardly tested. Surely we should
require that each software module or subsystem undergo thorough,
systematic testing before being incorporated into the main product.
But what do we mean by thorough testing? What is the criterion by
which we can judge the adequacy of a suite of tests that a software
artifact has passed?
Ideally, we should like an "adequate" test suite to be one that
ensures correctness of the product. Unfortunately, that goal is not
attainable. The difficulty of proving that some set of test cases is
adequate in this sense is equivalent to the difficulty of proving that
the program is correct. In other words, we could have "adequate"
testing in this sense only if we could establish correctness without
any testing at all.
In practice we settle for criteria that identify inadequacies in test
suites. For example, if the specification describes different treatment
in two cases, but the test suite does not check that the two cases
are in fact treated differently, then we may conclude that the test
suite is inadequate to guard against faults in the program logic. If no
test in the test suite executes a particular program statement, we
might similarly conclude that the test suite is inadequate to guard
against faults in that statement. We may use a whole set of
(in)adequacy criteria, each of which draws on some source of
information about the program and imposes a set of obligations that
an adequate set of test cases ought to satisfy. If a test suite fails to
satisfy some criterion, the obligation that has not been satisfied may
provide some useful information about improving the test suite. If a
set of test cases satisfies all the obligations by all the criteria, we still
do not know definitively that it is a well-designed and effective test
suite, but we have at least some evidence of its thoroughness.
consistent manner. Unfortunately, the terms we will need are not always used
consistently in the literature, despite the existence of an IEEE standard that defines
several of them. The terms we will use are defined as follows.
Test case A test case is a set of inputs, execution conditions, and a pass/fail
criterion. (This usage follows the IEEE standard.)
Test case specification A test case specification is a requirement to be satisfied by
one or more actual test cases. (This usage follows the IEEE standard.)
Test obligation A test obligation is a partial test case specification, requiring some
property deemed important to thorough testing. We use the term obligation to
distinguish the requirements imposed by a test adequacy criterion from more
complete test case specifications.
Test suite A test suite is a set of test cases. Typically, a method for functional
testing is concerned with creating a test suite. A test suite for a program, system, or
individual unit may be made up of several test suites for individual modules,
subsystems, or features. (This usage follows the IEEE standard.)
Test or test execution We use the term test or test execution to refer to the activity
of executing test cases and evaluating their results. When we refer to "a test," we
mean execution of a single test case, except where context makes it clear that the
reference is to execution of a whole test suite. (The IEEE standard allows this and
other definitions.)
Adequacy criterion A test adequacy criterion is a predicate that is true (satisfied) or
false (not satisfied) of a program, test suite pair. Usually a test adequacy
criterion is expressed in the form of a rule for deriving a set of test obligations from
another artifact, such as a program or specification. The adequacy criterion is then
satisfied if every test obligation is satisfied by at least one test case in the suite.
Test specifications drawn from program source code require coverage of particular
elements in the source code or some model derived from it. For example, we might
require a test case that traverses a loop one or more times. The general term for testing
based on program structure is structural testing, although the term white-box testing or
glass-box testing is sometimes used.
Previously encountered faults can be an important source of information regarding useful
test cases. For example, if previous products have encountered failures or security
breaches due to buffer overflows, we may formulate test requirements specifically to
check handling of inputs that are too large to fit in provided buffers. These fault-based
test specifications usually draw also from interface specifications, design models, or
source code, but add test requirements that might not have been otherwise considered.
A common form of fault-based testing is fault-seeding, purposely inserting faults in
source code and then measuring the effectiveness of a test suite in finding the seeded
faults, on the theory that a test suite that finds seeded faults is likely also to find other
faults.
Test specifications need not fall cleanly into just one of the categories. For example, test
specifications drawn from a model of a program might be considered specificationbased if the model is produced during program design, or structural if it is derived from
the program source code.
Consider the Java method of Figure 9.1. We might apply a general rule that requires
using an empty sequence wherever a sequence appears as an input; we would thus
create a test case specification (a test obligation) that requires the empty string as
input.[1] If we are selecting test cases structurally, we might create a test obligation that
requires the first clause of the if statement on line 15 to evaluate to true and the second
clause to evaluate to false, and another test obligation on which it is the second clause
that must evaluate to true and the first that must evaluate to false.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
/**
* Remove/collapse multiple spaces.
*
* @param String string to remove multiple spaces from.
* @return String
*/
public static String collapseSpaces(String argStr)
{
char last = argStr.charAt(0);
StringBuffer argBuf = new StringBuffer();
for (int cIdx=0; cIdx < argStr.length(); cIdx++)
{
char ch = argStr.charAt(cIdx);
if (ch != '' || last != '')
{
argBuf.append(ch);
last = ch;
}
}
return argBuf.toString();
}
Figure 9.1: A Java method for collapsing sequences of blanks, excerpted from the
StringUtils class of Velocity version 1.3.1, an Apache Jakarta project. Apache
Group, used by permission.
[1]Constructing
and using catalogs of general rules like this is described in Chapter 10.
met by a test suite containing a single test case, with the input value (value of argStr)
being "doesn'tEvenHaveSpaces." Requiring both the true and false branches of each
test to be taken subsumes the previous criterion and forces us to at least provide an
input with a space that is not copied to the output, but it can still be satisfied by a suite
with just one test case. We might add a requirement that the loop be iterated zero
times, once, and several times, thus requiring a test suite with at least three test cases.
The obligation to execute the loop body zero times would force us to add a test case
with the empty string as input, and like the specification-based obligation to consider an
empty sequence, this would reveal a fault in the code.
Should we consider a more demanding adequacy criterion, as indicated by the
subsumes relation among criteria, to be a better criterion? The answer would be "yes" if
we were comparing the guarantees provided by test adequacy criteria: If criterion A
subsumes criterion B, and if any test suite satisfying B in some program is guaranteed to
find a particular fault, then any test suite satisfying A is guaranteed to find the same fault
in the program. This is not as good as it sounds, though. Twice nothing is nothing.
Adequacy criteria do not provide useful guarantees for fault detection, so comparing
guarantees is not a useful way to compare criteria.
A better statistical measure of test effectiveness is whether the probability of finding at
least one program fault is greater when using one test coverage criterion than another.
Of course, such statistical measures can be misleading if some test coverage criteria
require much larger numbers of test cases than others. It is hardly surprising if a
criterion that requires at least 300 test cases for program P is more effective, on
average, than a criterion that requires at least 50 test cases for the same program. It
would be better to know, if we have 50 test cases that satisfy criterion B, is there any
value in finding 250 test cases to finish satisfying the "stronger" criterion A, or would it
be just as profitable to choose the additional 250 test cases at random?
Although theory does not provide much guidance, empirical studies of particular test
adequacy criteria do suggest that there is value in pursuing stronger criteria, particularly
when the level of coverage attained is very high. Whether the extra value of pursuing a
stronger adequacy criterion is commensurate with the cost almost certainly depends on
a plethora of particulars, and can only be determined by monitoring results in individual
organizations.
it is likely that future theoretical progress must begin with a quite different conception of
the fundamental goals of a theory of test adequacy.
The trend in research is toward empirical, rather than theoretical, comparison of the
effectiveness of particular test selection techniques and test adequacy criteria. Empirical
approaches to measuring and comparing effectiveness are still at an early stage. A
major open problem is to determine when, and to what extent, the results of an empirical
assessment can be expected to generalize beyond the particular programs and test
suites used in the investigation. While empirical studies have to a large extent displaced
theoretical investigation of test effectiveness, in the longer term useful empirical
investigation will require its own theoretical framework.
Further Reading
Goodenough and Gerhart made the original attempt to formulate a theory of "adequate"
testing [GG75]; Weyuker and Ostrand extended this theory to consider when a set of
test obligations is adequate to ensure that a program fault is revealed [WO80].
Gourlay's exposition of a mathematical framework for adequacy criteria is among the
most lucid developments of purely analytic characterizations [Gou83]. Hamlet and Taylor
show that, if one takes statistical confidence in (absolute) program correctness as the
goal, none of the standard coverage testing techniques improve on random testing
[HT90], from which an appropriate conclusion is that confidence in absolute correctness
is not a reasonable goal of systematic testing. Frankl and Iakounenko's study of test
effectiveness [FI98] is a good example of the development of empirical methods for
assessing the practical effectiveness of test adequacy criteria.
Related Topics
Test adequacy criteria and test selection techniques can be categorized by the sources
of information they draw from. Functional testing draws from program and system
specifications, and is described in Chapters 10, 11, and 14. Structural testing draws
from the structure of the program or system, and is described in Chapters 12 and 13.
The techniques for testing object-oriented software described in Chapter 15 draw on
both functional and structural approaches. Selection and adequacy criteria based on
consideration of hypothetical program faults are described in Chapter 16.
Exercises
Deterministic finite state machines (FSMs), with states representing classes of
program states and transitions representing external inputs and observable
program actions or outputs, are sometimes used in modeling system requirements.
We can design test cases consisting of sequences of program inputs that trigger
FSM transitions and the predicted program actions expected in response. We can
also define test coverage criteria relative to such a model. Which of the following
coverage criteria subsume which others?
State coverage For each state in the FSM model, there is a test case that visits
9.1 that state.
Transition coverage For each transition in the FSM model, there is a test case
that traverses that transition.
Path coverage For all finite-length subpaths from a distinguished start state in the
FSM model, there is at least one test case that includes a corresponding subpath.
State-pair coverage For each state r in the FSM model, for each state s
reachable from r along some sequence of transitions, there is at least one test
case that passes through state r and then reaches state s.
Adequacy criteria may be derived from specifications (functional criteria) or code
(structural criteria). The presence of infeasible elements in a program may make it
impossible to obtain 100% coverage. Since we cannot possibly cover infeasible
elements, we might define a coverage criterion to require 100% coverage of
feasible elements (e.g., execution of all program statements that can actually be
reached in program execution). We have noted that feasibility of program elements
9.2 is undecidable in general. Suppose we instead are using a functional test adequacy
criterion, based on logical conditions describing inputs and outputs. It is still
possible to have infeasible elements (logical condition A might be inconsitent with
logical condition B, making the conjunction A B infeasible). Would you expect
distinguishing feasible from infeasible elements to be easier or harder for functional
criteria, compared to structural criteria? Why?
Suppose test suite A satisfies adequacy criterion C1. Test suite B satisfies
9.3 adequacy criterion C2, and C2 subsumes C1. Can we be certain that faults revealed
by A will also be revealed by B?
10.1 Overview
In testing and analysis aimed at verification[2] - that is, at finding any discrepancies
between what a program does and what it is intended to do - one must obviously refer
to requirements as expressed by users and specified by software engineers. A
functional specification, that is, a description of the expected behavior of the program, is
the primary source of information for test case specification.
Functional testing, also known as black-box or specification-based testing, denotes
techniques that derive test cases from functional specifications. Usually functional testing
techniques produce test case specifications that identify classes of test cases and are
instantiated to produce individual test cases.
The core of functional test case design is partitioning[3] the possible behaviors of the
program into a finite number of homogeneous classes, where each such class can
reasonably be expected to be consistently correct or incorrect. In practice, the test case
designer often must also complete the job of formalizing the specification far enough to
serve as the basis for identifying classes of behaviors. An important side benefit of test
design is highlighting the weaknesses and incompleteness of program specifications.
Deriving functional test cases is an analytical process that decomposes specifications
into test cases. The myriad aspects that must be taken into account during functional
test case specification makes the process error prone. Even expert test designers can
miss important test cases. A methodology for functional test design helps by
decomposing the functional test design process into elementary steps. In this way, it is
possible to control the complexity of the process and to separate human intensive
activities from activities that can be automated.
Sometimes, functional testing can be fully automated. This is possible, for example,
when specifications are given in terms of some formal model, such as a grammar or an
extended state machine specification. In these (exceptional) cases, the creative work is
performed during specification and design of the software. The test designer's job is
then limited to the choice of the test selection criteria, which defines the strategy for
generating test case specifications. In most cases, however, functional test design is a
human intensive activity. For example, when test designers must work from informal
specifications written in natural language, much of the work is in structuring the
specification adequately for identifying test cases.
[1]We
use the term program generically for the artifact under test, whether that artifact is
a complete application or an individual unit together with a test harness. This is
consistent with usage in the testing research literature.
[2]Here
behavior and its specifications with respect to the users' expectations, is treated in
Chapter 22.
[3]We
are using the term partition in a common but rather sloppy sense. A true partition
would form disjoint classes, the union of which is the entire space. Partition testing
separates the behaviors or input space into classes whose union is the entire space, but
the classes may not be disjoint.
potential input even if "missing case" were included in the catalog of potential faults.
Functional specifications often address semantically rich domains, and we can use domain
information in addition to the cases explicitly enumerated in the program specification. For
example, while a program may manipulate a string of up to nine alphanumeric characters,
the program specification may reveal that these characters represent a postal code, which
immediately suggests test cases based on postal codes of various localities. Suppose the
program logic distinguishes only two cases, depending on whether they are found in a table
of U.S. zip codes. A structural testing criterion would require testing of valid and invalid U.S.
zip codes, but only consideration of the specification and richer knowledge of the domain
would suggest test cases that reveal missing logic for distinguishing between U.S.-bound
mail with invalid U.S. zip codes and mail bound for other countries.
Functional testing can be applied at any level of granularity where some form of
specification is available, from overall system testing to individual units, although the level of
granularity and the type of software influence the choice of the specification styles and
notations, and consequently the functional testing techniques that can be used.
In contrast, structural and fault-based testing techniques are invariably tied to program
structures at some particular level of granularity and do not scale much beyond that level.
The most common structural testing techniques are tied to fine-grain program structures
(statements, classes, etc.) and are applicable only at the level of modules or small
collections of modules (small subsystems, components, or libraries).
As an extreme example, suppose we are allowed to select only three test cases for a
program that breaks a text buffer into lines of 60 characters each. Suppose the first test
case is a buffer containing 40 characters, and the second is a buffer containing 30
characters. As a final test case, we can choose a buffer containing 16 characters or a
buffer containing 100 characters. Although we cannot prove that the 100-character buffer is
the better test case (and it might not be; the fact that 16 is a power of 2 might have some
unforeseen significance), we are naturally suspicious of a set of tests that is strongly biased
toward lengths less than 60.
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34 }
Figure 10.1: The Java class roots, which finds roots of a quadratic
equation. The case analysis in the implementation is incomplete: It
does not properly handle the case in which b2 4ac = 0 and a = 0.
We cannot anticipate all such faults, but experience teaches that
boundary values identifiable in a specification are disproportionately
valuable. Uniform random generation of even large numbers of test
cases is ineffective at finding the fault in this program, but selection
of a few "special values" based on the specification quickly uncovers
it.
Of course, it is unlikely that anyone would test only with random values. Regardless of the
overall testing strategy, most test designers will also try some "special" values. The test
designer's intuition comports with the observation that random sam-pling is an ineffective
way to find singularities in a large input space. The observation about singularities can be
generalized to any characteristic of input data that defines an infinitesimally small portion of
the complete input data space. If again we have just three real-valued inputs a, b, and c,
there is an infinite number of choices for which b = c, but random sampling is unlikely to
generate any of them because they are an infinitesimal part of the complete input data
space.
The observation about special values and random samples is by no means limited to
numbers. Consider again, for example, breaking a text buffer into lines. Since line breaks
are permitted at blanks, we would consider blanks a "special" value for this problem. While
random sampling from the character set is likely to produce a buffer containing a sequence
of at least 60 nonblank characters, it is much less likely to produce a sequence of 60
blanks.
Partition testing typically increases the cost of each test case, since
in addition to generation of a set of classes, creation of test cases
from each class may be more expensive than generating random
test data. In consequence, partition testing usually produces fewer
test cases than random testing for the same expenditure of time
and money. Partitioning can therefore be advantageous only if the
average value (fault detection effectiveness) is greater.
If we were able to group together test cases with such perfect knowledge that the outcome
of test cases in each class were uniform (either all successes or all failures), then partition
testing would be at its theoretical best. In general we cannot do that, nor can we even
quantify the uniformity of classes of test cases. Partitioning by any means, including
specification-based partition testing, is always based on experience and judgment that
leads one to believe that certain classes of test case are "more alike" than others, in the
sense that failure-prone test cases are likely to be concentrated in some classes. When we
appealed earlier to the test designer's intuition that one should try boundary cases and
special values, we were actually appealing to a combination of experience (many failures
occur at boundary and special cases) and knowledge that identifiable cases in the
specification often correspond to classes of input that require different treatment by an
implementation.
Given a fixed budget, the optimum may not lie in only partition testing or only random
testing, but in some mix that makes use of available knowledge. For example, consider
again the simple numeric problem with three inputs, a, b, and c. We might consider a few
special cases of each input, individually and in combination, and we might consider also a
few potentially significant relationships (e.g., a = b). If no faults are revealed by these few
test cases, there is little point in producing further arbitrary partitions - one might then turn
to random generation of a large number of test cases.
[4]Note that the relative value of different test cases would be quite
Identification of functional features that can be tested separately is different from module
decomposition. In both cases we apply the divide and conquer principle, but in the
former case, we partition specifications according to the functional behavior as
perceived by the users of the software under test,[5] while in the latter, we identify logical
units that can be implemented separately. For example, a Web site may require a sort
function, as a service routine, that does not correspond to an external functionality. The
sort function may be a functional feature at module testing, when the program under test
is the sort function itself, but is not a functional feature at system test, while deriving test
cases from the specifications of the whole Web site. On the other hand, the registration
of a new user profile can be identified as one of the functional features at system-level
testing, even if such functionality is spread across several modules. Thus, identifying
functional features does not correspond to identifying single modules at the design level,
but rather to suitably slicing the specifications to attack their complexity incrementally.
Independently testable features are described by identifying all the inputs that form their
execution environments. Inputs may be given in different forms depending on the notation
used to express the specifications. In some cases they may be easily identifiable. For
example, they can be the input alphabet of a finite state machine specifying the behavior
of the system. In other cases, they may be hidden in the specification. This is often the
case for informal specifications, where some inputs may be given explicitly as
parameters of the functional unit, but other inputs may be left implicit in the description.
For example, a description of how a new user registers at a Web site may explicitly
indicate the data that constitutes the user profile to be inserted as parameters of the
functional unit, but may leave implicit the collection of elements (e.g., database) in which
the new profile must be inserted.
Trying to identify inputs may help in distinguishing different functions. For example, trying
to identify the inputs of a graphical tool may lead to a clearer distinction between the
graphical interface per se and the associated callbacks to the application. With respect
to the Web-based user registration function, the data to be inserted in the database are
part of the execution environment of the functional unit that performs the insertion of the
user profile, while the combination of fields that can be used to construct such data is
part of the execution environment of the functional unit that takes care of the
management of the specific graphical interface.
Identify Representative Classes of Values or Derive a Model The execution
environment of the feature under test determines the form of the final test cases, which
are given as combinations of values for the inputs to the unit. The next step of a testing
process consists of identifying which values of each input should be selected to form test
cases. Representative values can be identified directly from informal specifications
expressed in natural language. Alternatively, representative values may be selected
indirectly through a model, which can either be produced only for the sake of testing or
be available as part of the specification. In both cases, the aim of this step is to identify
the values for each input in isolation, either explicitly through enumeration or implicitly
trough a suitable model, but not to select suitable combinations of such values (i.e., test
case specifications). In this way, we separate the problem of identifying the
representative values for each input from the problem of combining them to obtain
meaningful test cases, thus splitting a complex step into two simpler steps.
Most methods that can be applied to informal specifications rely on explicit enumeration
Consider the input of a procedure that searches for occurrences of a complex pattern in
a Web database. Its input may be characterized by the length of the pattern and the
presence of special characters in the pattern, among other aspects. Interesting values
for the length of the pattern may be zero, one, or many. Interesting values for the
presence of special characters may be zero, one, or many. However, the combination of
value "zero" for the length of the pattern and value "many" for the number of special
characters in the pattern is clearly impossible.
The test case specifications represented by the Cartesian product of all possible inputs
must be restricted by ruling out illegal combinations and selecting a practical subset of
the legal combinations. Illegal combinations are usually eliminated by constraining the set
of combinations. For example, in the case of the complex pattern presented above, we
can constrain the choice of one or more special characters to a positive length of the
pattern, thus ruling out the illegal cases of patterns of length zero containing special
characters.
Selection of a practical subset of legal combination can be done by adding information
that reflects the hazard of the different combinations as perceived by the test designer
or by following combinatorial considerations. In the former case, for example, we can
identify exceptional values and limit the combinations that contain such values. In the
pattern example, we may consider only one test for patterns of length zero, thus
eliminating many combinations that would otherwise be derived for patterns of length
zero. Combinatorial considerations reduce the set of test cases by limiting the number of
combinations of values of different inputs to a subset of the inputs. For example, we can
generate only tests that exhaustively cover all combinations of values for inputs
considered pair by pair.
Depending on the technique used to reduce the space represented by the Cartesian
product, we may be able to estimate the number of generated test cases generated and
modify the selected subset of test cases according to budget considerations. Subsets of
combinations of values (i.e., potential special cases) can often be derived from models
of behavior by applying suitable test selection criteria that identify subsets of interesting
behaviors among all behaviors represented by a model, for example by constraining the
iterations on simple elements of the model itself. In many cases, test selection criteria
can be applied automatically.
Generate Test Cases and Instantiate Tests The test generation process is completed
by turning test case specifications into test cases and instantiating them. Test case
specifications can be turned into test cases by selecting one or more test cases for
each test case specification. Test cases are implemented by creating the scaffolding
required for their execution.
[5]Here
the word "user" designates the individual using the specified service. It can be the
user of the system, when dealing with a system specification, but it can be another
approach, it is important to evaluate all relevant costs. For example, generating a large
number of random test cases may necessitate design and construction of sophisticated
test oracles, or the cost of training to use a new tool may exceed the advantages of
adopting a new approach.
Scaffolding costs Each test case specification must be converted to a concrete test
case, executed many times over the course of development, and checked each time for
correctness. If generic scaffolding code required to generate, execute, and judge the
outcome of a large number of test cases can be written just once, then a combinatorial
approach that generates a large number of test case specifications is likely to be
affordable. If each test case must be realized in the form of scaffolding code written by
hand - or worse, if test execution requires human involvement - then it is necessary to
invest more care in selecting small suites of test case specifications.
Many engineering activities require careful analysis of trade-offs. Functional testing is no
exception: Successfully balancing the many aspects is a difficult and often
underestimated problem that requires skilled designers. Functional testing is not an
exercise of choosing the optimal approach, but a complex set of activities for finding a
suitable combination of models and techniques that yield a set of test cases to satisfy
cost and quality constraints. This balancing extends beyond test design to software
design for test. Appropriate design not only improves the software development
process, but can greatly facilitate the job of test designers and lead to substantial
savings.
illustrated in this chapter, test cases can be derived in two broad ways, either by
identifying representative values or by deriving a model of the unit under test. A
variety of formal models could be used in testing. The research challenge lies in
identifying a trade-off between costs of creating formal models and savings in
automatically generating test cases.
Development of a general framework for deriving test cases from a range of
formal specifications. Currently research addresses techniques for generating
test cases from individual formal methods. Generalization of techniques will allow
more combinations of formal methods and testing.
Another important research area is fed by interest in different specification and design
paradigms (e.g., software architectures, software design patterns, and service- oriented
applications). Often these approaches employ new graphical or textual notations.
Research is active in investigating different approaches to automatically or
semiautomatically deriving test cases from these artifacts and studying the effectiveness
of existing test case generation techniques.
Increasing size and complexity of software systems is a challenge to testing. Existing
functional testing techniques do not take advantage of test cases available for parts of
the artifact under test. Compositional approaches for deriving test cases for a given
system taking advantage of test cases available for its subsystems is an important open
research problem.
Further Reading
Functional testing techniques, sometimes called black-box testing or specification- based
testing, are presented and discussed by several authors. Ntafos [DN81] makes the case
for random rather than systematic testing; Frankl, Hamlet, Littlewood, and Strigini
[FHLS98] is a good starting point to the more recent literature considering the relative
merits of systematic and statistical approaches.
Related topics
Readers interested in practical technique for deriving functional test specifications from
informal specifications and models may continue with the next two chapters, which
describe several functional testing techniques. Readers interested in the
complementarities between functional and structural testing may continue with Chapters
12 and 13, which describe structural and data flow testing.
Exercises
In the Extreme Programming (XP) methodology (see the sidebar on page 381), a
written description of a desired feature may be a single sentence, and the first
10.1 step to designing the implementation of that feature is designing and implementing
a set of test cases. Does this aspect of the XP methodology contradict our
assertion that test cases are a formalization of specifications?
10.2
1. Compute the probability of selecting a test case that reveals the fault in
line 19 of program Root of Figure 10.1 by randomly sampling the input
domain, assuming that type double has range 231 231 1.
2. Compute the probability of randomly selecting a test case that reveals a
fault if lines 13 and 19 were both missing the condition a 0.
, to clear display
, where
Required Background
Chapter 10
Understanding the limits of random testing and the needs of a systematic
approach motivates the study of combinatorial as well as model-based testing
techniques. The general functional testing process illustrated in Section 10.3
helps position combinatorial techniques within the functional testing process.
11.1 Overview
In this chapter, we introduce three main techniques that are successfully used in
industrial environments and represent modern approaches to systematically derive test
cases from natural language specifications: the category-partition approach to identifying
attributes, relevant values, and possible combinations; combinatorial sampling to test a
large number of potential interactions of attributes with a relatively small number of
inputs; and provision of catalogs to systematize the manual aspects of combinatorial
testing.
The category-partition approach separates identification of the values that characterize
the input space from the combination of different values into complete test cases. It
provides a means of estimating the number of test cases early, size a subset of cases
that meet cost constraints, and monitor testing progress.
Pairwise and n-way combination testing provide systematic ways to cover interactions
among particular attributes of the program input space with a relatively small number of
test cases. Like the category-partition method, it separates identification of
characteristic values from generation of combinations, but it provides greater control
over the number of combinations generated.
The manual step of identifying attributes and representative sets of values can be made
more systematic using catalogs that aggregate and synthesize the experience of test
designers in a particular organization or application domain. Some repetitive steps can
be automated, and the catalogs facilitate training for the inherently manual parts.
These techniques address different aspects and problems in designing a suite of test
cases from a functional specification. While one or another may be most suitable for a
specification with given characteristics, it is also possible to combine ideas from each.
designer has seen enough failures on "degenerate" inputs to test empty collections
wherever a collection is allowed.
The number of options that can (or must) be configured for a particular model of
computer may vary from model to model. However, the category-partition method
makes no direct provision for structured data, such as sets of slot,selection pairs. A
typical approach is to "flatten" collections and describe characteristics of the whole
collection as parameter characteristics. Typically the size of the collection (the length of
a string, for example, or in this case the number of required or optional slots) is one
characteristic, and descriptions of possible combinations of elements (occurrence of
special characters in a string, for example, or in this case the selection of required and
optional components) are separate parameter characteristics.
Suppose the only significant variation among slot,selection pairs was between pairs
that are compatible and pairs that are incompatible. If we treated each pair as a
separate characteristic, and assumed n slots, the category-partition method would
generate all 2n combinations of compatible and incompatible slots. Thus we might have a
test case in which the first selected option is compatible, the second is compatible, and
the third incompatible, and a different test case in which the first is compatible but the
second and third are incompatible, and so on. Each of these combinations could be
combined in several ways with other parameter characteristics. The number of
combinations quickly explode. Moreover, since the number of slots is not actually fixed,
we cannot even place an upper bound on the number of combinations that must be
considered. We will therefore choose the flattening approach and select possible
patterns for the collection as a whole.
Identifying and Bounding Variation
It may seem that drawing a boundary between a fixed program and a variable set of
parameters would be the simplest of tasks for the test designer. It is not always so.
Consider a program that produces HTML output. Perhaps the HTML is based on a
template, which might be encoded in constants in C or Java code, or might be
provided through an external data file, or perhaps both: it could be encoded in a C or
source code file that is generated at compile time from a data file. If the HTML
template is identified in one case as a parameter to varied in testing, it seems it
should be so identified in all three of these variations, or even if the HTML template is
embedded directly in print statements of the program, or in an XSLT transformation
script.
The underlying principle for identifying parameters to be varied in testing is
anticipation of variation in use. Anticipating variation is likewise a key part of
architectural and detailed design of software. In a well-designed software system,
module boundaries reflect "design secrets," permitting one part of a system to be
modified (and retested) with minimum impact on other parts. The most frequent
changes are facilitated by making them input or configurable options. The best
software designers identify and document not only what is likely to change, but how
often and by whom. For example, a configuration or template file that may be
modified by a user will be clearly distinguished from one that is considered a fixed
part of the system.
Ideally the scope of anticipated change is both clearly documented and consonant
with the program design. For example, we expect to see client-customizable aspects
of HTML output clearly isolated and documented in a configuration file, not embedded
in an XSLT script file and certainly not scattered about in print statements in the code.
Thus, the choice to encode something as "data" rather than "program" should at least
be a good hint that it may be a parameter for testing, although further consideration
of the scope of variation may be necessary. Conversely, defining the parameters for
variation in test design can be part of the architectural design process of setting the
scope of variation anticipated for a given product or release.
Should the representative values of the flattened collection of pairs be one compatible
selection, one incompatible selection, all compatible selections, all incompatible
selections, or should we also include mix of 2 or more compatible and 2 or more
incompatible selections? Certainly the latter is more thorough, but whether there is
sufficient value to justify the cost of this thoroughness is a matter of judgment by the test
designer.
We have oversimplified by considering only whether a selection is compatible with a slot.
It might also happen that the selection does not appear in the database. Moreover, the
selection might be incompatible with the model, or with a selected component of another
slot, in addition to the possibility that it is incompatible with the slot for which it has been
selected. If we treat each such possibility as a separate parameter characteristic, we
will generate many combinations, and we will need semantic constraints to rule out
combinations like there are three options, at least two of which are compatible with the
model and two of which are not, and none of which appears in the database. On the
other hand, if we simply enumerate the combinations that do make sense and are worth
testing, then it becomes more difficult to be sure that no important combinations have
been omitted. Like all design decisions, the way in which collections and complex data
are broken into parameter characteristics requires judgment based on a combination of
analysis and experience.
B. Identify Representative Values This step consists of identifying a list of
representative values (more precisely, a list of classes of values) for each of the
parameter characteristics identified during step A. Representative values should be
identified for each category independently, ignoring possible interactions among values
[error]
not in database
[error]
valid
Number of required slots for selected
model (#SMRS)
[single]
[single]
[property RSNE]
[single]
[property OSNE]
[single]
many
[property RSNE],
[property RSMANY]
many
[property OSNE],
[property OSMANY]
Parameter: Components
Correspondence of
selection with model
slots
omitted slots
[error]
extra slots
[error]
mismatched slots
[error]
complete
correspondence
Number of required components with nonempty selection
< number of
required slots
[if OSNE]
= number of required
slots
[if RSMANY]
= number of
required slots
[if OSMANY]
some default
all valid
all valid
1 incompatible with
slot
1 incompatible
with slot
1 incompatible with
another selection
1 incompatible
with another
selection
1 incompatible with
model
1 incompatible
with model
1 not in database
[single]
[error]
0
1
many
[error]
[single]
0
1
[error]
[single]
many
Identifying relevant values is an important but tedious task. Test designers may improve
manual selection of relevant values by using the catalog approach described in Section
11.4, which captures the informal approaches used in this section with a systematic
application of catalog entries.
C. Generate Test Case Specifications A test case specification for a feature is given
as a combination of value classes, one for each identified parameter characteristic.
Unfortunately, the simple combination of all possible value classes for each parameter
characteristic results in an unmanageable number of test cases (many of which are
impossible) even for simple specifications. For example, in the Table 11.1 we find 7
categories with 3 value classes, 2 categories with 6 value classes, and one with four
value classes, potentially resulting in 37 62 4 = 314,928 test cases, which would be
acceptable only if the cost of executing and checking each individual test case were very
small. However, not all combinations of value classes correspond to reasonable test
case specifications. For example, it is not possible to create a test case from a test
case specification requiring a valid model (a model appearing in the database) where
the database contains zero models.
The category-partition method allows one to omit some combinations by indicating value
classes that need not be combined with all other values. The label [error] indicates a
value class that need be tried only once, in combination with non-error values of other
parameters. When [error] constraints are considered in the category-partition
specification of Table 11.1, the number of combinations to be considered is reduced to
1331135522+11 = 2711. Note that we have treated "component not in
database" as an error case, but have treated "incompatible with slot" as a normal case
of an invalid configuration; once again, some judgment is required.
Although the reduction from 314,928 to 2,711 is impressive, the number of derived test
cases may still exceed the budget for testing such a simple feature. Moreover, some
values are not erroneous per se, but may only be useful or even valid in particular
combinations. For example, the number of optional components with non-empty
selection is relevant to choosing useful test cases only when the number of optional slots
is greater than 1. A number of non-empty choices of required component greater than
zero does not make sense if the number of required components is zero.
Erroneous combinations of valid values can be ruled out with the property and if-property
constraints. The property constraint groups values of a single parameter characteristic
to identify subsets of values with common properties. The property constraint is
indicated with label property PropertyName, where PropertyName identifies the property
for later reference. For example, property RSNE (required slots non- empty) in Table
11.1 groups values that correspond to non-empty sets of required slots for the
parameter characteristic Number of Required Slots for Selected Model (#SMRS), i.e.,
values 1 and many. Similarly, property OSNE (optional slots non-empty) groups nonempty values for the parameter characteristic Number of Optional Slots for Selected
Model (#SMOS).
The if-property constraint bounds the choices of values for a parameter characteristic
that can be combined with a particular value selected for a different parameter
characteristic. The if-property constraint is indicated with label if PropertyName, where
PropertyName identifies a property defined with the property constraint. For example,
the constraint if RSNE attached to value 0 of parameter characteristic Number of
required components with non-empty selection limits the combination of this value with
values 1 and many of the parameter characteristics Number of Required Slots for
Selected Model (#SMRS). In this way, we rule out illegal combinations like Number of
required components with non-empty selection = 0 with Number of Required Slots for
Selected Model (#SMRS) = 0.
The property and if-property constraints introduced in Table 11.1 further reduce the
number of combinations to be considered to 1 3 1 1 (3 + 2 + 1) 5 5 2 2
+ 11 = 1811.
The number of combinations can be further reduced by iteratively adding property and ifproperty constraints and by introducing the new single constraint, which is indicated with
label single and acts like the error constraint, i.e., it limits the number of occurrences of
a given value in the selected combinations to 1.
Test designers can introduce new property, if-property, and single constraints to reduce
the total number of combinations when needed to meet budget and schedule limits.
Placement of these constraints reflects the test designer's judgment regarding
combinations that are least likely to require thorough coverage.
The single constraints introduced in Table 11.1 reduces the number of combinations to
be considered to 1 1 1 1 1 3 4 4 1 1 + 19 = 67, which may be a
reasonable balance between cost and quality for the considered functionality. The
number of combinations can also be reduced by applying the pairwise and n-way
combination testing techniques, as explained in the next section.
The set of combinations of value classes for the parameter characteristics can be turned
into test case specifications by simply instantiating the identified combinations. Table
11.2 shows an excerpt of test case specifications. The error tag in the last column
indicates test case specifications corresponding to the error constraint. Corresponding
test cases should produce an error indication. A dash indicates no constraints on the
choice of values for the parameter or environment element.
Table 11.2: An excerpt of test case specifications derived from the value classes give
Open table as spreadsheet
#
#
#
Corr.
Required
Optional
# Required # Optional
required optional w/
components componen
components components
slots
slots
model
selection
selection
slots
Model#
malformed
many
many same
EQR
all valid
all va
Not in DB
many
many same
EQR
all valid
all va
valid
many same
all valid
all va
valid
many
many same
EQR
EQO
in-other
in-m
valid
many
many same
EQR
EQO
in-mod
all va
valid
many
many same
EQR
EQO
in-mod
in-s
valid
many
many same
EQR
EQO
in-mod
in-oth
valid
many
many same
EQR
EQO
in-mod
in-m
Legend
EQR
= # req
slot
EQO
= # opt
slot
in-mod
1 incompat w/
model
in-other
1 incompat w/
another slot
in-slot
1 incompat w/
slot
Choosing meaningful names for parameter characteristics and value classes allows
(semi)automatic generation of test case specifications.
[1]At
this point, readers may ignore the items in square brackets, which indicate
constraints identified in step C of the category-partition method.