0% found this document useful (0 votes)
22 views11 pages

Software Evolution

Software evolution refers to the ongoing process of developing and updating software systems to meet user needs, which is essential for business-critical applications. The document provides an overview of the historical context of software evolution, key research areas such as program comprehension, reverse engineering, reengineering, and software repository mining, and highlights the challenges faced in evolving software systems. It emphasizes the importance of understanding existing software for effective evolution and discusses various strategies and tools used in the field.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views11 pages

Software Evolution

Software evolution refers to the ongoing process of developing and updating software systems to meet user needs, which is essential for business-critical applications. The document provides an overview of the historical context of software evolution, key research areas such as program comprehension, reverse engineering, reengineering, and software repository mining, and highlights the challenges faced in evolving software systems. It emphasizes the importance of understanding existing software for effective evolution and discusses various strategies and tools used in the field.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Software Evolution

Andy Zaidman, Martin Pinzger, Arie van Deursen

Software Engineering Research Group


Delft University of Technology
The Netherlands
{[Link], [Link], [Link]}@[Link]

Abstract The situation at ING illustrates what is common in


many software-intensive organizations: software systems
Software evolution is the term used in software engineer- are business critical and are required to run continuously,
ing to refer to the process of developing an initial version of often on a 7 × 24 hour basis. The systems are complex in
the software and then repeatedly updating it to satisfy the nature, and should operate under hard to meet performance
user’s needs. Software evolution is an inevitable activity, as and scaleability criteria. Under these constraints, engineers
useful and successful software stimulates users to request are required to adjust the systems to new business oppor-
new and improved features. However, evolving a software tunities, emerging technologies, law changes, and so on.
system is typically difficult and costly. In this chapter, we These changes are to be made in a cost-effective manner,
provide an historical overview of the field and survey four without loss of quality of the existing functionality.
important research ares: program comprehension, reverse Making changes to existing software systems is one of
engineering, reengineering, and software repository min- the key software engineernig challenges, and covered by
ing. We report on key approaches, results, and indicate a the field of software evolution. We provide a historical
number of challenges open to on-going and future research overview of this field, and survey four important research
in software evolution. areas in software evolution:

∙ Program comprehension, addressing the difficulties


Keywords: software evolution, software reengineering, people have in understanding the structure of complex
software reverse engineering, software repository mining, software systems;
program comprehension, static analysis, dynamic analysis
∙ Reverse engineering, comprising methods and tech-
niques to distill abstract information from existing sys-
1 Introduction
tems;
In a recent advertisement aimed at recruiting new soft-
∙ Reengineering, offering tools and techniques to re-
ware engineers, ING, one of the largest European banks,
structure existing software systems; and
summarized some of its key information technology fig-
ures1 . ING serves over 85 million customers, conducts
∙ Software repository mining, involving the study of his-
16 million financial transactions each day, making use of
torical data collected in bug tracking systems, revision
75,000 computers running over 3000 different applications,
control systems, mailing lists, etc. in order to increase
storing 10 petabytes of data. Furthermore, the underlying
our understanding of the underlying systems.
software systems are not static, but subject to continuous
evolution: ING indicated it was involved in 1449 change These topics are addressed in Section 3–6, after which
projects. we conclude with pointers to research venues and avenues
1 Automatisering Gids, nr. 42, 16 oktober 2009. for future research.

1
2 Historical Perspective ative and incremental (evolutionary) approach to software
development. It takes into account that software is created
In the 1960’s awareness grew that manufacturing soft- in a highly collaborative manner and explicitly accommo-
ware should be less ad hoc, and instead should be based dates the changing needs of its stakeholders, even late in
on theoretical foundations and practical disciplines. This the development cycle.
awareness culminated in the organization of the first soft- Nowadays, software evolution has become a very active
ware engineering conference in 1967 by the NATO Science and well-respected field of research in software engineer-
Committee [28]. ing, and the terms software evolution and software main-
In 1970 then, inspired by established engineering dis- tenance are often used as synonyms. The fact that the
ciplines, Royce proposed the waterfall life-cycle process software evolution research area is so active, can in part
for software development [38]. Of particular importance be explained through the fact that software systems have
in this model was the definition of the maintenance phase a prolonged lifetime: some of the early systems written
for software systems, which was considered the final phase in the 1960’s and 1970’s are currently still in use. These
of the software life-cycle and which happened after its de- so-called legacy systems are still crucial to the business en-
ployment. The IEEE 1219 standard defines software main- vironment and simply replacing them might involve high
tenance as: “the modification of a software product after risk and high costs. Still, these systems must also evolve
delivery to correct faults, to improve performance or other in order to stay useful in today’s operating context. That
attributes, or to adapt the product to a modified environ- is why sub-fields such as program comprehension, reverse
ment.” engineering, mining software repositories, testing, impact
analysis, cost estimation, software configuration manage-
It took a while before software engineers became aware
ment and re-engineering are so important to understand the
of the inherent limitations of this software process model,
software and enable the continued evolution of these and
namely the fact that the separation in phases was too strict
more modern software systems.
and inflexible, and that it is often unrealistic to assume that
the requirements are known before starting the software de-
sign phase. In the late seventies, a first attempt was made 3 Program Comprehension
towards a more evolutionary process model, the so-called
“change mini-cycle” as proposed by Yau et al. [41]. Having a sufficient understanding of the software system
Also in the seventies, Manny Lehman started to for- is a necessary prerequisite to be able to successfully accom-
mulate his laws of software evolution, see Table 1. The plish many software engineering activities, including soft-
postulated laws were based on earlier work carried out by ware evolution and software re-engineering related tasks.
Lehman to understand the change process being applied
to IBM’s OS 360 operating system. His original findings Definition Program comprehension is the task of building
were confirmed in later studies involving other software mental models of an underlying software system at various
systems [25]. This was probably the first time that the term levels of abstraction, ranging from models of the code itself
software evolution was explicitly used to stress the differ- to ones of the underlying application domain [33].
ence with the post-deployment activity of software mainte-
nance. Being aware of the fact that almost all software evo-
Nevertheless, it took until the nineties until the term soft- lution activities require understanding of the software
ware evolution gained widespread acceptance. Also around system, the link between software evolution and program
this time evolutionary processes such as Boehm’s spiral understanding becomes immediately clear. Unknown
model gained acceptance. In that same category, Bennet perhaps is the fact that building up this knowledge can
and Rajlich’s staged model explicitly takes into account the take up to 60% of the time allocated for a particular task,
inevitable problem of software aging [36]. After initial de- making program comprehension a costly necessity [10].
velopment of a first running version, the “evolution stage” Von Mayrhauser and Vans have done research on how
allows for any kind of modification to the software, as long software engineers go about their comprehension process
as the architectural integrity remains preserved. If this is no and they have identified three distinct strategies, namely: a
longer the case, there is a loss of evolvability and the “ser- top-down model, a bottom-up model or a mix of the previ-
vicing stage” starts. During this stage, only small patches ous two, the so-called integrated model [39].
can be applied to keep the software up and running. Top-down understanding typically applies when the
Software evolution is also a crucial ingredient of so- code, problem domain and/or solution space are familiar
called agile software development, of which extreme pro- to the software engineer. Because of these similarities, the
gramming (XP) [2] is probably the most famous proponent. software engineer typically forms a number of hypotheses
In brief, agile software development is a lightweight iter- about the structure of the system. Subsequently, hypotheses

2
1 Continuing Change Systems must be continually adapted else they become progressively less satisfactory.
2 Increasing Complexity As a system evolves its complexity increases unless work is done to maintain or reduce it.
3 Self Regulation System evolution processes are self regulating with distribution of product and process measures close
to normal.
4 Conservation of Organizational Stability Unless feedback mechanisms are appropriately adjusted, average effective global activity rate in an
evolving system tends to remain constant over product lifetime.
5 Conservation of Familiarity As a system evolves, all associated with it, developers, sales personnel, users, for example, must main-
tain mastery of its content and behavior to achieve satisfactory evolution. Excessive growth diminishes
that mastery. Hence the average incremental growth remains invariant as the system evolves.
6 Continuing Growth The functional capability of systems must be continually increased to maintain user satisfaction over
the system lifetime.
7 Declining Quality Unless rigorously adapted to take into account for changes in the operational environment, the quality
of a system will appear to be declining.
8 Feedback System Evolution processes are multi-level, multi-loop, multi-agent feedback systems.

Table 1. Laws of Software Evolution

are iteratively refined, passing several levels until they can the process of analyzing a subject system to identify the
be matched to specific code in the program. systems components and their interrelationships and cre-
Pennington found that the bottom-up program compre- ate representations of the system in another form or at a
hension model is often used when the code and/or problem higher level of abstraction.” Reverse engineering has been
domain are not familiar to the software engineer [37]. This traditionally viewed as a two step process: information ex-
way of understanding starts at the code level and while iden- traction and abstraction. Information extraction analyses
tifying elementary blocks of source code in the program, the subject system artifacts to gather raw data, whereas ab-
microstructures are chunked together to form macrostruc- straction creates user-oriented documents and views. The
tures and macrostructures are linked to each other via cross- primary purpose of reverse engineering a software system
referencing. Another bottom-up approach that can be em- is to increase the overall comprehensibility of the system
ployed is the so-called situation model, which concentrates for both maintenance and new development. In particular,
on a dataflow/functional abstraction instead of relying on reverse engineering provides ways to [9]:
control-flow, as described earlier. Cope with complexity. It allows to better deal with the
Finally, the integrated model for program comprehen- shear volume and complexity of systems, by (automat-
sion involves top-down and bottom-up comprehension, and ically) abstracting to a higher level.
also a knowledge base. The knowledge base, which typ- Generate alternate views. Tools facilitate the
ically is the human mind, stores (1) any new information (re)generation of graphical representations. In
that is obtained directly from the application through either particular, these tools often allow to generate alternate
of the two program comprehension strategies or (2) infor- views, thereby offering the chance to study the system
mation that is inferred. In practice, the integrated model is from a different perspective.
frequently used when trying to understand large-scale sys- Recover lost information. In continuously evolving sys-
tems, largely because software engineers are typically fa- tems, modifications are frequently not reflected in doc-
miliar with certain parts of the source code, while they are umentation. In this context, reverse engineering allows
less familiar with other parts. to recover designs.
Detect side effects. Comparing the initial designs with the
4 Reverse Engineering current designs as obtained from reverse engineering
tools allows to spot deviations from the original design
Usually, the system’s maintainers are not the software plans.
engineers that originally designed the system. Thus, be- Synthesize higher abstractions. Reverse engineering
fore making changes to the software they must first build tools frequently create alternate views at higher levels
up sufficient knowledge of the software system at hand. In of abstraction.
Section 3 on program comprehension, we have seen that this Facilitate reuse. Reverse engineering can help detect can-
process can take up to 60% of the allocated time. It is in this didates for reusable software components from soft-
context that reverse engineering tools can play an important ware systems.
role as they can facilitate the program comprehension pro- Reverse engineering techniques can be classified accord-
cess. ing to the artifacts that they analyze: static analysis extracts
In their seminal paper, Chikofsky and Cross define re- properties from software systems through the analysis of
verse engineering as follows [9]: “reverse engineering is source code, documentation, architectural diagrams, or de-

3
sign information. Dynamic analysis meanwhile analyses
data gathered from a running programming and thus studies
the actual behavior of the software. Hybrid approaches then
combine static and dynamic analysis. We will now discuss
static and dynamic analysis in more depth.

4.1 Static Analysis

Static analysis, or the analysis of source code, documen-


tation, architectural diagrams, or design information, has
been successfully applied in a number of areas. Key chal-
lenges in static analysis-based reverse engineering are to ab- Figure 1. Extravis, an example of visualization
stract low-level information, e.g., source code, into more of dynamic analysis [11].
manageable and easier to understand higher-level abstrac-
tions, e.g, control-flow diagrams. Apart from abstracting,
another challenge is to establish links between different ar- with sensible abstraction mechanisms for trace data in order
tifacts, e.g., establish links between existing documentation to overcome scalability issues [43]. Research in this area
or requirements and the source code. has resulted in tools that allow to do feature localization,
Low-level examples of static analysis are the generation bug localization, etc.
of control-flow diagrams and performing data-flow analy-
sis in order to better understand software systems in the 4.3 Tools
small. Passing over program dependence graphs and tech-
niques like slicing [40], which enable program comprehen- A number of reverse engineering tools have been pro-
sion in the large, there are also many other static analysis duced by the research community. Two of these tools have
techniques which have successfully been applied. Some ex- become real frameworks enabling a wide range of static
amples of problem areas where reverse engineering with and/or dynamic analyses for reverse engineering. These
static analysis has been successfully applied are [7]: re- tools are Moose2 , which is a platform for software analy-
documenting programs and relational databases, identifying sis originally developed at the University of Berne, Switzer-
reusable assets, recovering architectures, recovering design land, and Rigi3 , an interactive, visual tool to understand and
patterns, building traceability links between code and docu- re-document software, developed at the University of Vic-
mentation, identifying code clones, code smells and aspects, toria, BC, Canada.
performing impact analysis, and many more.

4.2 Dynamic Analysis 5 Reengineering

Dynamic analysis, or the analysis of data gathered from From the laws of software evolution that we discussed in
a running program, has the potential to provide an accurate Section 2 we intuitively understand that systems must con-
picture of a software system because it exposes the systems tinuously be adapted to meet changing requirements from
actual behavior. Among the benefits over static analysis are its users (law 1) and that we should take preventive actions
the availability of runtime information and, in the context to reduce the increasing complexity which is the result of
of object-oriented software, the exposure of object identi- successive adaptations of the software (law 2). Reengineer-
ties and the actual resolution of late binding. Drawbacks ing is the term coined for the process of reorganizing and
are that dynamic analysis can only provide a partial pic- modifying an existing software system with the aim of mak-
ture of the system, i.e., the results obtained are valid for the ing the system easier to maintain. Chikofsky and Cross de-
scenarios that were exercised during the analysis, and that fine reengineering as follows [9]: “The examination and al-
dynamic analysis typically involves the collection and anal- teration of a system to reconstitute it in a new form”. Please
ysis of large amounts of data, often introducing scalability note that this definition implies that reverse engineering —
issues with tools when analyzing large software systems. the examination part — is often part of the reengineering
Cornelissen et al. have performed a survey of dy- cycle. In the following, we introduce legacy software sys-
namic analysis techniques for program understanding pur- tems and subsequently ways to measure and resolve typical
poses [12]. From this broad overview, we see that research shortcomings with them.
in this area revolves around creating ultra-scalable visual- 2 [Link]

izations, e.g., Extravis [11] (see Figure 1), and coming up 3 [Link]

4
5.1 Legacy Software Systems modules, number of methods/procedures per class/module,
etc.
There is a clear connection between reengineering and
legacy software systems. Often, reengineering is the most 5.3 Code Smells and Problem Detection
cost-effective option for an organization to extend the life
of their software systems. Legacy software is software that Code smells are symptoms in the source code or the be-
is still very much useful to an organization – quite often havior of a software system that possibly indicate a deeper
even indispensable – but the evolution of which becomes a problem [18]. Many code smells are associated with diffi-
great burden [3]. Brodie and Stonebraker give an apt de- culties in maintaining the software system, e.g., the dupli-
scription of a legacy system [5]: “Any information system cate code code smell. In general, code smells can be iden-
that significantly resists modification and evolution to meet tified in several ways: through the use of code metrics, par-
new and constantly changing business requirements.” Note ticular relations between source code elements, specific be-
that this definition implies that age is no criterion when con- havior of the software system or through a series of changes
sidering whether a system is a legacy system [14], implying that have been made to the code.
that even relatively new systems can be considered legacy Table 2 gives an overview of some common code smells.
systems if they are of high-value to the organization and re- Fowler considers the duplicate code code smell as the num-
sist evolution. ber 1 code smell, a smell that should absolutely be avoided.
Legacy software is omni-present: think of the large soft- Duplicated pieces of source code are typically referred to
ware systems that were designed and first put to use in the as ‘clones’, which Basit and Jarzabek define as [1]: “code
1960s or 1970s; these software systems are nowadays of- fragments of considerable length and significant similar-
ten the backbone of large multinational corporations. For ity”. Because code clones are considered an important code
banks, healthcare institutions, etc. these systems are vital smell, the area has seen a great interest of the research
for their daily operations. As such, failure of these soft- community. This interest can be explained by the fact that
ware systems is not an option and that is why these trusted code clones are seen as plausible arguments for an increased
“oldtimers” are still cared for every day. Furthermore, they maintenance effort, in particular, changes have to be made
are still being evolved to keep up with the current and fu- multiple times because the code is redundant. Another im-
ture business requirements. This is where reengineering ap- portant reason for the research interest in code clones is that
proaches can be useful: reengineering allows to make the clones can lead to bugs. In particular, when only a few in-
future evolution of these legacy systems easier. stances of a cloning relation are changed, there might be
side-effects from those instances that have erroneously not
5.2 Software Metrics been changed.
Research in the area has on the one hand concentrated
Considering the quality of a software system requires to on developing code clone detection and removal tech-
take a multidimensional viewpoint. Indeed, the ISO 9126 niques [23]. Many of these removal techniques are based
standard identifies six key quality attributes for computer upon series of refactorings (see Section 5.4). On the other
software: (1) functionality, (2) reliability, (3) usability, (4) hand, researchers have also concentrated on code clone
efficiency, (5) maintainability, and (6) portability. From a management, which entails that the Integrated Development
software evolution perspective, the fifth quality attribute, Environment (IDE) which the software engineer is using,
maintainability, is of great importance, as this attribute de- is aware of the clones and can warn the software engineer
termines the ease with which the software can be adapted. when he is actually making changes to an instance of a code
In particular, a number of internal properties of the soft- clone [15].
ware can be measured and captured with metrics. A num- Nevertheless, code cloning is not all bad. Kapser and
ber of these metrics have shown to be affecting the main- Godfrey’s study has shown that code cloning is sometimes a
tainability of the software system [24]. For example, com- purposeful implementation strategy which makes sense un-
plexity is often correlated with maintenance effort; that is, der certain circumstances [22].
the more complex the code, the more effort is required to
maintain it. One of the most frequently used measures dur- 5.4 Refactoring
ing maintenance is the McCabe cyclomatic complexity [27],
which measures the number of linearly independent paths Refactoring is a disciplined technique for restructuring
through the code. Other measures that influence the main- an existing body of code [18]. Crucial to this restructuring
tainability are the more traditional object-oriented design approach is that while the internal structure of the software
metrics (see Chidamber and Kemerer [8]) and simple met- is improved, the external behavior is not changed. Typi-
rics like lines of code, number of comment lines, number of cally, refactoring a software system is composed of a series

5
Duplicate code identical or very similar pieces of source code
God class an object that controls too many other objects in the system; an object that does everything
Long method a method that tries to do too many things
Large class a class that contains too many subparts and methods
Long parameter list a method that tries to do too much, too far away from home, with too many subparts of the system
Divergent change occurs when a class is frequently changed in different ways for different reasons, an early sign of a god class in the
making
Shotgun surgery each time you want to make a single, seemingly coherent change, you have to change lots of classes in little ways
Feature envy a method seems more interested in another class than the one it is defined in.
Inappropriate intimacy a method that has too much intimate knowledge of another class or method’s inner workings, inner data, etc

Table 2. An overview of some typical code smells [18]

of small behavior preserving transformations. These small ∙ Visualization techniques to represent data and mining
steps ensure that (1) there is less chance that something can results;
go wrong, and (2) they make it easier to verify that the be-
havior has indeed been preserved by making use of unit test- A survey on corresponding approaches for mining soft-
ing [31]. ware repositories has been presented by Kagdi et al. in [21].
One should however be careful that the refactorings do We further would like to refer interested readers to the spe-
not break the unit tests and thus invalidate the safety net that cial issue on MSR that has been edited by Nagappan et
is provided by the unit tests. Research has shown that at al. [35].
least 20 of the refactorings that are listed by Fowler in [18] One of the main objectives is to improve current software
break the unit tests [31]. This in turn implies that unit tests development practices by learning from historical evidence
can also become an evolutionary burden, because they too provided by information stored in software archives. In the
need to be refactored. following, we list the main research directions of software
repository mining and for each summarize key approaches
and research results.
6 Mining Software Repositories
6.1 Data Modeling
With the advent of open source projects, configuration
management systems, bug-tracking systems, open source
Recent literature highlights source-control systems,
projects, and, last but not least, the Internet as communi-
defect-tracking systems, and mailing lists archives as the
cation platform new information sources became available
main data sources for MSR. Source-control systems, such
for empirical software engineering research. The growing
as CVS or subversion, are used for managing and stor-
interest in this research led to special tracks at software en-
ing the various versions of source code artifacts and the
gineering conferences and later on to workshops and confer-
modification reports that led to the new versions. Defect-
ences on its own. The most popular venue is called Mining
tracking systems, such as Bugzilla or Jira, are used to re-
Software Repositories (MSR)4 that addresses the following
port and manage defects and enhancement requests. Mail-
research topics:
ing lists keep track of discussions between software devel-
∙ Meta models to integrate, represent, and exchange in- opers, testers, and also end users. Furthermore, the advent
formation stored in software repositories; of Web 2.0 technologies (i.e., Wiki, Blog, Twitter, etc.) pro-
vides various new alternatives to share information and dis-
∙ Meta models to model social, organizational, software cussions over the process, design, implementation, etc. of
development processes; software systems.
One of the key problems faced by researchers in the do-
∙ Meta models to represent software quality aspects; main of mining software repositories is to extract and link
all the different information sources to obtain a more com-
∙ Mining techniques to analyze the information; plete picture over the software project. As noted by Kagdi
et al., software repositories vary in their usage, information
∙ Application areas in which the results can be applied,
content, and storage format. Typically, they are managed
e.g., search-based software engineering, software evo-
and operated in isolation and have no explicit direct rela-
lution analysis, software reliability assessment, cost
tionship with each other [21]. Furthermore, not all software
estimation, bug and change impact analysis.
projects follow the same development process (or a devel-
4 More information on this series of conferences can be found at: opment process at all). For example, while some projects
[Link] agreed on using a versioning system, they do not use a bug-

6
check−in package
tracking system. Although a project team is using a version-
ing system, a bug-tracking system, and mailing lists, they CVS

operate them in isolation. There is no strict process rule that File


enforces developers to link commits to bugs and comments version
in mailing lists. Last but not least, these tools were explic- implements
similar to similar to reply to
itly made for developing software but not for performing
WWW Bugzilla Usenet, Mail
software archeology as done with software repository min-
ing. Change
Document Message
Improvements and a step towards a more measurable documents task about

software development process can be expected from re-


cent trends in Integrated Development Environments (e.g., Hipikat works on
IBM’s Jazz, Microsoft Team Foundation Server) and open posts
writes
source products, such as Hudson5 , CruiseControl6 , and Person
writes
Continuum7 . They provide support for continuous integra-
tion that better integrates the various tools and repositories
for developing software systems. Figure 3. The Hipikat data model integrating
A number of approaches have been introduced in the past versioning and bug report data with mails and
to re-establish the links between the information residing in project documentation.
the different software repositories. The main objective is
to obtain a common information source which then can be
input into mining algorithms to analyze various aspects of
the software system. In the following we summarize ap- mail archives, and online project documentation [13]. Fig-
proaches that have been frequently used and cited by the ure 3 depicts the Hipikat data model. It comes with a text
MSR community. similarity matcher to infer links between the various data el-
The release history database (RHDB) integrates data ements. Through these links the group memory then can be
from versioning systems obtained from CVS or SVN with queried and navigated by developers to obtain recommen-
bug report data obtained from Bugzilla [17]. Figure 2 de- dations on a task at hand. Another interesting feature is the
picts the corresponding data model whereas the link is de- incremental update of the group memory when new reports,
noted by the association between the FileVersion and mails, file revisions, and emails are added.
BugReport entities.
The file history for each source file is extracted from 6.2 Applications of Software Repository Mining
the log information that is queried from the versioning sys-
tems, for example with the CVS log command. Bug report The integrated data models, such as provided by RHDB
data is obtained from the Bugzilla repository by request- and Hipikat offer various application areas to analyze differ-
ing the bug in XML format. The XML then is parsed and ent aspects of a software system related to software evolu-
extracted information is stored into the RHDB. The links tion. Kagdi et al. [21] list a number of MSR tasks that have
between file revisions and bug reports are established in a been addressed by recent approaches. These are: Evolution-
post-processing step. Regular expressions, such as ary couplings/patterns, change classification/representation,
change comprehension, defect classification and analysis,
bugi?d?:?=?\s*#?\s*(\d\d\d+)(.*) source code differencing, origin analysis and refactoring,
software reuse, development processes and communication,
are used to query bug numbers in the log messages stored at contribution analysis, evolution metrics. Out of these appli-
each file revision. Whenever a valid bug number is found a cation areas and tasks we focus on defect prediction and
link to that bug report is stored into the database. A similar recommender systems which we will detail in the follow-
approaches has been presented by Bevan et al. with the ing.
Kenyon framework [4].
There have been several extensions to the RHDB ap-
Defect Prediction The main goal of defect prediction is
proach that take into account additional data sources. For
to calculate a model that when applied to the current sys-
example, Hipikat forms an implicit group memory that, in
tem tells the developers which modules will most likely be
addition to versioning and bug report data, stores data from
affected by a bug in the next release and/or how man bugs
5 [Link] that will be. The results of the prediction then can be used
6 [Link] to take preventive and corrective actions, such as refactor-
7 [Link] ing the highlighted software modules and increase testing

7
FileHistory
*
rcsfile Alias
Module Directory 1
Project 1 * directories 1 * files workingfile * * name
modules 1 * head date
files subdirectories
1 locks usagecount
* revisions
1

* *
BugReport
FileVersion
id
version
severity
fileHistory
shortDescription
BugDescription date
OS
text * 1 * * author * 1 Author
priority
who state name
product
when linesAdded
component
linesRemoved
resolution
branches
qaContact
commitMessage
LongDescriptions

Figure 2. The Release History Database model for integrating versioning with bug report data.

efforts for these modules. and Moser et al. [32]. In contrast, Menzies et al. used static
Various machine learning techniques (binary and linear code attributes, such as program size and complexity met-
regression analysis, decision trees, naive Bayes, super vec- rics, to predict defects [30]. Their results showed predic-
tor machines, etc.) can be used to train and test predic- tors with a mean probability of detection of 71 percent and
tion models. Basically, resulting modules establish relation- mean false alarms rates of 25 percent which denote models
ships between a set of independent variables and a depen- with reasonable quality. These authors also made an im-
dent variable. Typically, the number of defects per module portant point: the choice of the learning method is far more
denotes the dependent variable as being predicted by the important than which subset of the available data is used
model. Product and process metrics, such as modules size for learning. They further advise to assess defect predic-
(e.g., lines of code) and complexity (e.g., McCabe), number tion methods with multiple data sets and multiple learners.
of changes and defects in the past, age of the module, are Another interesting discussion with valuable feedback on
typically used as the independent variables. previous research in defect prediction is presented by Fen-
ton and Neil [16]. They point out relevant issues, such as
A prediction model first is trained and tested with met- the unknown relationship between defects and failures, the
ric values obtained from past software releases. Ideally, use of multivariate approaches, and issues in the statistical
the validation then is performed with metrics obtained from methodology and data quality. These issues should be dealt
subsequent software releases. Often, however, it is per- with when doing research in this direction.
formed with the same data, but then using standard valida-
tion techniques, such as ten-fold-cross validation. Depend-
ing on the machine learning algorithm used, various perfor- Recommender Systems Another promising application
mance measures are used to evaluate the predictive power of the integrated data models are recommender systems.
of obtained models, such as precision, recall, and area un- Whenever a modification task is performed the recom-
der ROC curve (AUC). Models with high performance are mender system observes the behavior of developers and
assumed to perform well for predicting the defects of future based on that context provides a set of recommendations
release. aiming at automating the change task and improving the
Early approaches mostly favored product metrics, such quality of a software system.
as size and complexity metrics, for defect prediction. Soon For example, Hipikat is a tool that provides recommen-
process metrics have been added and discussions arose dations about project information a developer should con-
which ones perform best. For example, Graves et al. ex- sider during a modification task [13]. For that it mines a
plored the extent to which the change history can be used to broad set of information sources including source file revi-
predict defects [19]. They found that the number of changes sions, bug reports, email archives, and online project docu-
and the age of modules outperform product measures, such mentation and establishes the links between these informa-
as lines of code. Their results were confirmed by a number tion sources. When a new bug is reported the developer can
of recent studies, such as presented by Nagappan et al. [34] query for similar bug reports and obtain recommendations

8
on linked source code modifications, email discussions, and Engineering and Methodology, Wiley’s Journal of Software
the project documentation which might aid in fixing the Maintenance and Evolution, Springer’s Emprical Software
bug. ROSE is an approach and tool that infers recommen- Engineering, Elsevier’s Journal of Systems and Software
dations from source file revisions by using association rule and Wiley’s Software Practice and Experience have fre-
mining [44]. Whenever a developer is performing a modi- quent contributions in the broader area of software evolu-
fication task ROSE’s recommendations guide him along re- tion.
lated changes in the way: “Programmers who changed this
function also changed these other functions.” In addition to
8 Future Challenges
these suggestions, ROSE also uses the association rules to
warn of incomplete changes. A similar approach, but on the
level of source files and using frequent item set mining, has Besides the general main areas (comprehension, reverse
been presented by Ying et al. in [42]. engineering, reengineering, and repository mining), a num-
In addition to approaches that mine the project history, ber of trends and application areas can be distinguished.
several approaches exist that mine source code reposito- The most important ones include the following.
ries to recommend examples on how to use an API or to
improve code completion. Code completion is a popular Requirements Traceability The need for change is re-
feature offered by modern integrated development environ- flected in evolving requirements. To assess the impact of
ments that is extensively used by developers. Basically, it changing requirements, it is essential that source code and
helps to prevent developers from writing not compilable design decisions can be traced back to requirements. Un-
source code, and to speed up programming by proposing fortunately, maintaining accurate requirements traceability
program elements that are syntactically correct. Investigat- links has proven to be hard and costly in practice. Promising
ing current code completion implementations, Bruch et al. research directions aim at partially automating this process,
argued that the quality of suggestions and hence the pro- for example through the use of information retrieval tech-
ductivity of software development canbe improved by em- niques such as latent semantic indexing [26]. By comput-
ploying intelligent code completion systems [6]. They eval- ing textual similarities between different work documents,
uated three such code completion systems of which the one candidate traceability links between, for example, code, test
using a modified version of the k-nearest-neighbor (kNN) cases, and requirements can be computed. Turning the can-
machine learning algorithm performed best. 82% of the didate links into confirmed links is still a manual process,
recommended method calls were actually needed by the but initial results are promising.
programmer (recall) and 72% of the recommended method
calls were relevant (precision).
Service-Oriented Architectures Software portfolios of
Concerning the usage of an application programming in-
large enterprises can easily comprise hundreds of applica-
terface (API), developers often encounter difficulties, such
tions. The evolution of the application landscape of such
as, which objects to instantiate, how to initialize them, and
a company over the years has typically resulted in a com-
which methods to call. The problem has been addressed by
plex network of connected systems, in which system depen-
a number of approaches, such as Strathcona [20]. Typically,
dencies are often critical, but equally often poorly under-
they form a knowledge base by mining a set of framework
stood. In order to regain control over the system dependen-
usage examples. Given the current context of the developer
cies many companies are investigating the use of service-
the tool queries the knowledge base and recommends code
oriented architectures. In this approach, systems are loosely
snippets of potential interest. However, it is still the task of
coupled, and communicate via an enterprise service bus.
a developer to integrated the appropriate code snipped into
Service interfaces are made as stable as possible, but ser-
the source code.
vice implementations can be (dynamically) identified and
modified.
7 Software Evolution in Research While promising in many ways, the adoption of service
orientation brings in a number of challenges. These in-
Software evolution has received considerable attention clude migrating legacy components into services (for ex-
over the past few years. If the reader is interested in consult- ample via wrapping), performance implications of data con-
ing the body of knowledge in the area of software evolution, version that is required to integrate components developed
the conferences listed in Table 3 form a good starting point. by different organizations, and systematic testing strategies
Furthermore, also journals like the IEEE Transactions for dealing with large collections of services not under di-
on Software Engineering, ACM’s Transactions on Software rect control of the testing organization. This testing may re-
8 Formerly IWPC, the International Workshop on Program Comprehen- quire obtaining production data and connecting to services
sion. in production. Since replication of test data and production

9
Conference acronym Conference name Year first organized
ICSM the International Conference on Software Maintenance 1983
ICPC the International Conference on Program Comprehension8 1993
WCRE the Working Conference on Reverse Engineering 1994
CSMR the European Conference on Software Maintenance and Reengineering 1997
MSR the Working Conference on Mining Software Repositories 2004
IWPSE the International Workshop on the Principles of Software Evolution 1998

Table 3. Conferences in the area of software evolution

services can be challenging, an interesting research area is test suite (when using Cucumber10 ).
to devise service infrastructures permitting (controlled) test This interplay between testing, test automation, and doc-
execution in the production environment. umentation comprises an interesting and highly promising
route in further supporting software evolution.
Collaborative Engineering Software development and
evolution is a team activity. Much of the existing body References
of work in reverse engineering and program comprehen-
sion is focused on the individual developer. Methods and [1] H. A. Basit and S. Jarzabek. Efficient token based clone de-
techniques have been proposed to visualize architectures, to tection with flexible tokenization. In Proceedings of the joint
identify features from execution traces, or to establish trace- meeting of the European Software Engineering Conference
and the ACM SIGSOFT Symposium on the Foundations of
ability links. The question arises how such tools can be used
Software Engineering (ESEC/FSE), pages 513–516. ACM,
to collaboratively understand, for example, the implemen-
2007.
tation of a particular feature. Such tools should support the [2] K. Beck. Extreme Programming Explained: Embrace
incremental growth of understanding, allowing different de- Change. Addison Wesley, 1999.
velopers to record their intermediate knowledge, and share [3] K. Bennett. Legacy systems: Coping with success. IEEE
it with their co-developers. Software, 12(1):19–23, 1995.
[4] J. Bevan, E. J. Whitehead, Jr., S. Kim, and M. Godfrey. Fa-
cilitating software evolution research with kenyon. In Pro-
Globally Distributed Development More and more, ceedings of the joint meeting of the European Software En-
software teams work on different locations spread across gineering Conference and the ACM SIGSOFT Symposium
the globe. In addition to this physical distance, the team on the Foundations of Software Engineering (ESEC/FSE),
members come from different cultures, often working in pages 177–186. ACM, 2005.
different time zones. Many of the techniques developed to [5] M. L. Brodie and M. Stonebraker. Migrating Legacy Sys-
support evolution (tools for program comprehension, im- tems: Gateways, Interfaces & The Incremental Approach.
pact analysis, or configuration management) will also help Morgan Kaufmann, 1995.
[6] M. Bruch, M. Monperrus, and M. Mezini. Learning from
to support teams counter the effects of these increased dis-
examples to improve code completion systems. In Proceed-
tances. ings of the joint meeting of the European Software Engineer-
ing Conference and the ACM SIGSOFT Symposium on the
Software Testing Automated testing is an important evo- Foundations of Software Engineering (ESEC/FSE), pages
213–222. ACM, 2009.
lution enabler. An automated test suite can be used for re-
[7] G. Canfora and M. Di Penta. New frontiers of reverse en-
gression testing, to ensure that software modifications do gineering. In L. C. Briand and A. L. Wolf, editors, Inter-
not break existing functionality. Furthermore, continuous national Conference on Software Engineering, ISCE 2007,
integration in combination with nightly execution of the test Workshop on the Future of Software Engineering, FOSE
suite will help to identify problems caused by changes made 2007, pages 326–341, 2007.
to the software as early as possible. [8] S. R. Chidamber and C. F. Kemerer. A metrics suite for
In agile development methods, test suites furthermore object oriented design. IEEE Transactions on Software En-
are used to facilitate program comprehension. As an exam- gineering, 20(6):476–493, 1994.
ple, in the Ruby community software is described via “ex- [9] E. J. Chikofsky and J. H. C. II. Reverse engineering and
design recovery: A taxonomy. IEEE Software, 7(1):13–17,
ecutable examples” either expressed in Ruby itself (when
1990.
using rspec9 ), or in natural language which via a simple pat- [10] T. A. Corbi. Program understanding: Challenge for the
tern recognition mechanism is translated into an executable 1990s. IBM Systems Journal, 28(2):294–306, 1989.
9 [Link] 10 [Link]
[11] B. Cornelissen, A. Zaidman, D. Holten, L. Moonen, A. van [29] T. Mens and S. Demeyer, editors. Software Evolution.
Deursen, and J. J. van Wijk. Execution trace analysis Springer, 2008.
through massive sequence and circular bundle views. Jour- [30] T. Menzies, J. Greenwald, and A. Frank. Data mining static
nal of Systems and Software, 81(12):2252–2268, 2008. code attributes to learn defect predictors. IEEE Transactions
[12] B. Cornelissen, A. Zaidman, A. van Deursen, L. Moonen, on Software Engineering, 33(1):2–13, 2007.
and R. Koschke. A systematic survey of program compre- [31] L. Moonen, A. van Deursen, A. Zaidman, and M. Bruntink.
hension through dynamic analysis. IEEE Transactions on On the interplay between software testing and evolution and
Software Engineering, 35(5):684–702, 2009. its effect on program comprehension. In Mens and Demeyer
[13] D. Cubranic, G. C. Murphy, J. Singer, and K. S. Booth.
[29], pages 173–202.
Hipikat: A project memory for software development.
IEEE Transactions on Software Engineering, 31(6):446– [32] R. Moser, W. Pedrycz, and G. Succi. A comparative anal-
465, 2005. ysis of the efficiency of change metrics and static code at-
[14] S. Demeyer, S. Ducasse, and O. Nierstrasz. Object-oriented tributes for defect prediction. In Proceedings of the Interna-
reengineering patterns. Morgan Kaufmann, 2003. tional Conference on Software Engineering (ICSE), pages
[15] E. Duala-Ekoko and M. P. Robillard. Clonetracker: tool sup- 181–190. ACM, 2008.
port for code clone management. In Proceedings of the 30th [33] H. A. Müller, S. R. Tilley, and K. Wong. Understanding soft-
International Conference on Software Engineering (ICSE), ware systems using reverse engineering technology perspec-
pages 843–846. ACM, 2008. tives from the rigi project. In Proceedings of the 1993 con-
[16] N. E. Fenton and M. Neil. A critique of software defect pre- ference of the Centre for Advanced Studies on Collaborative
diction models. IEEE Transactions on Software Engineer- research (CASCON), pages 217–226. IBM Press, 1993.
ing, 25(5):675–689, 1999. [34] N. Nagappan and T. Ball. Use of relative code churn mea-
[17] M. Fischer, M. Pinzger, and H. Gall. Populating a release sures to predict system defect density. In Proceedings of the
history database from version control and bug tracking sys- International Conference on Software Engineering (ICSE),
tems. In Proceedings of the International Conference on pages 284–292. ACM, 2005.
Software Maintenance (ICSM), pages 23–32. IEEE Com-
[35] N. Nagappan, A. Zeller, and T. Zimmermann. Guest ed-
puter Society, 2003.
[18] M. Fowler. Refactoring: Improving the Design of Existing itors’ introduction: Mining software archives. IEEE Soft-
Code. Addison-Wesley, 1999. ware, 26(1):24–25, 2009.
[19] T. L. Graves, A. F. Karr, J. S. Marron, and H. Siy. Predicting [36] D. L. Parnas. Software aging. In Proceedings of the Inter-
fault incidence using software change history. IEEE Trans- national Conference on Software Engineering (ICSE), pages
actions on Software Engineering, 26(7):653–661, 2000. 279–287. IEEE Computer Society Press, 1994.
[20] R. Holmes, R. J. Walker, and G. C. Murphy. Approximate [37] N. Pennington. Stimulus structures and mental prepresenta-
structural context matching: An approach to recommend rel- tions in expert comprehension of computer programs. Cog-
evant examples. IEEE Transactions on Software Engineer- nitive Psychology, 19(3), 1987.
ing, 32(12):952–970, 2006. [38] W. W. Royce. Managing the development of large soft-
[21] H. Kagdi, M. L. Collard, and J. I. Maletic. A survey and tax- ware systems: concepts and techniques. In Proc. IEEE
onomy of approaches for mining software repositories in the WESTCON. IEEE Computer Society Press, August 1970.
context of software evolution. Journal of Software Mainte- Reprinted in Proc. ICSE 1989, ACM Press, pp. 328-338.
nance and Evolution: Research and Practice, 19(2):77–131,
[39] A. von Mayrhauser and A. M. Vans. Program compre-
2007.
[22] C. Kapser and M. W. Godfrey. “cloning considered harmfu” hension during software maintenance and evolution. IEEE
considered harmful. In Proceedings of the 13th Working Computer, 28(8):44–55, August 1995.
Conference on Reverse Engineering (WCRE), pages 19–28. [40] M. Weiser. Program slicing. In Proceedings of the 5th inter-
IEEE Computer Society, 2006. national conference on Software engineering (ICSE), pages
[23] R. Koschke. Identifying and removing software clones. In 439–449. IEEE Press, 1981.
Mens and Demeyer [29], pages 15–36. [41] S. S. Yau, J. S. Colofello, and T. MacGregor. Ripple ef-
[24] M. Lanza and R. Marinescu. Object-Oriented Metrics in fect analysis of software maintenance. In Proc. COMPSAC,
Practice - Using Software Metrics to Characterize, Eval- pages 60–65. IEEE Computer Society Press, 1978.
uate, and Improve the Design of Object-Oriented Systems. [42] A. T. T. Ying, G. C. Murphy, R. Ng, and M. C. Chu-Carroll.
Springer, 2006. Predicting source code changes by mining change history.
[25] M. M. Lehman and L. A. Belady. Program Evolution: Pro- IEEE Transactions on Software Engineering, 30(9):574–
cesses of Software Change. Apic Studies In Data Processing. 586, 2004.
Academic Press, 1985.
[26] M. Lormans, A. van Deursen, and H.-G. Gross. An indus- [43] A. Zaidman and S. Demeyer. Automatic identification of key
trial case study in reconstructing requirements views. Em- classes in a software system using webmining techniques.
pirical Software Engineering, 13:727–760. Journal of Software Maintenance, 20(6):387–417, 2008.
[27] T. J. McCabe. A complexity measure. IEEE Transactions [44] T. Zimmermann, P. Weissgerber, S. Diehl, and A. Zeller.
on Software Engineering, 2(4):308–320, 1976. Mining version histories to guide software changes. IEEE
[28] T. Mens. Introduction and roadmap: History and challenges Transactions on Software Engineering, 31(6):429–445,
of software evolution. In Mens and Demeyer [29], pages 2005.
1–11.

11

You might also like