Architecture-Based Performance Analysis Applied To A Telecommunication System
Architecture-Based Performance Analysis Applied To A Telecommunication System
a Telecommunication System
effects of architectural decisions can be evaluated at an early stage by constructing and analyzing
quantitative performance models, which capture the interactions between the main components
of the system as well as the performance attributes of the components themselves. The paper
models from a UML description of the high-level architecture of a system, and more exactly
from the architectural patterns used for the system. The performance model structure retains a
clear relationship with the system architecture, which simplifies the task of converting
performance analysis results into conclusions and recommendations related to the software
architecture. In the second part of the paper, the proposed approach is applied to a
telecommunication product for which a LQN model is built and analyzed. The analysis shows
software) under different loads and configurations, and exposes some weaknesses in the original
software architecture, which prevent the system from using the available processing power at full
1
1. Introduction
Performance characteristics (such as response time and throughput) are an integral part of the
quality attributes of a software system. There is a growing body of research that studies the role
of software architecture in determining different quality characteristics in general [12], [1], and
performance characteristics in special [15], [16]. Architectural decisions are made very early in
the software development process, therefore it would be helpful to be able to assess their effect
This paper contributes toward bridging the gap between software architecture and early
the high-level software architecture of a system, which describes the main system components
and their interactions. The architectural descriptions on which the construction of a performance
model is based must capture certain issues relevant to performance, such as concurrency and
parallelism, contention for software resources (as, for example, for software servers or critical
Frequently used architectural solutions are identified in literature as architectural patterns (such
roles, and helps our understanding of complex systems. The paper proposes a systematic
in a system into a performance sub-model. The advantage of using patterns is that they are
already identified and catalogued, so we can build a library of transformation rules for
converting patterns to performance models. If, however, not all components and interactions of
2
a high-level architecture are covered by previously identified architectural patterns, we can still
describe the remaining interactions as UML mechanisms [2] and proceed by defining ad-hoc
The formalism used for building performance models is the Layered Queueing Network (LQN)
model [11, 17, 18], an extension of the well-known Queueing Network model. LQN was
developed especially for modelling concurrent and/or distributed software systems. Some LQN
components represent software processes, others hardware devices. One of the most interesting
performance characteristics of such systems is that a software process may play a dual role,
acting both as a client to some processes /devices, and as a server to others (see section 3 for a
more detailed description). Since a software server may have many clients, important queueing
delays may arise for it. The server may become a software bottleneck, thus limiting the potential
performance of the system. This can occur even if the devices used by the process are not fully
utilized. The analysis of an LQN model produces results such as response time, throughput,
queueing delays and utilization of different software and hardware components, and indicates
which components are the system bottleneck(s). By understanding the cause for performance
limitations, the developers will be able to concentrate on the system’s “trouble spots” in order to
eliminate or mitigate the bottlenecks. The analysis of LQN models for various alternatives will
help in choosing the “right” changes, so that the system will eventually meet its performance
requirements.
Software Performance Engineering (SPE) is a technique introduced in [14] that proposes to use
quantitative methods and performance models in order to assess the performance effects of
different design and implementation alternatives, from the earliest stages of software
development throughout the whole lifecycle. LQN modelling is very appropriate for such a use,
3
due to fact that the model structure can be derived systematically from the high-level architecture
of the system, as proposed in this paper. Since the high-level architecture is decided early in the
development process and does not change frequently afterwards, the structure of the LQN model
is also quite stable. However, the LQN model parameters (such execution times of the high-level
design and implementation decisions. In the early development stages, the parameter values are
components, on known platform overheads (such as system call execution times) and on time
budgets allocated to different components. As the development progresses and more components
are implemented and measured, the model parameters become more accurate, and so do the
results. In [14] it is shown that early performance modelling has definite advantages, despite its
inaccurate results, especially when the model and its parameters are continuously refined
LQN was applied to a number of concrete industrial systems (such as database applications [5],
web servers [7], telecommunication systems, etc.) and was proven to be useful for providing
insights into performance limitations at software and hardware levels, for suggesting
performance improvements in different development stages, for system sizing and for capacity
planning. In this paper, LQN is applied to a real telecommunication system. Although the
structure of the LQN model was derived from the high-level architecture of the system, which
was chosen in the early development stage, we used model parameters obtained from prototype
measurements, which are more accurate than the estimated values available in pre-
implementation phases. The reason is that we became involved with the project when the system
was undergoing performance tuning, and so we used the best data available to analyze the high-
4
level architecture of the system (which was unchanged from the early design stages). We found
some weaknesses in the original architecture due to excessive serialization, and used the LQN
The paper proceeds as follows: section 2 discusses architectural patterns and the UML notation
[2] used to represent them. Section 3 gives a brief description of the LQN model. Section 4
proposes transformations of the architectural patterns into LQN sub-models. Section 5 presents
the telecommunication system case study and its LQN model. Section 6 analyzes the LQN model
under various loads and configurations, shows how the bottleneck moves around the system and
proposes improvements to the system. Section7 gives the conclusions of the work.
3. Architectural Patterns
According to [1], a software architecture represents a collection of computational components
that perform certain functions, together with a collection of connectors that describe the
functions, and by a set of ports representing logical points of interaction between the component
and its environment. A connector type is defined as a set of roles explaining the expected
behaviour of the interacting parties, and a glue specification showing how the interactions are
coordinated.
A similar, even though less formal, view of a software architecture is described in the form of
architectural patterns [3, 13] which identify frequently used architectural solutions, such as
architectural pattern describes two inter-related aspects: its structure (what are the components)
and behaviour (how they interact). In the case of high-level architectural patterns, the
5
filter1 filter2
UpStreamFilter
DownStreamFilter
PIPELINE proc_item()
WITH MESSAGE
wait for
UpStreamFilter DownStreamFilter item
send_item()
<<process>> <<process>>
proc_item()
filter1 filter2 wait for
next item
Figure 1. Structural and behavioural views of the collaboration for PIPELINE WITH MESSAGE
Figure 2. Structural and behavioural view of the collaboration PIPELINE WITH BUFFER
components are usually concurrent entities that execute in different threads of control, compete
for resources, and interact in a prescribed manner which may require some kind of
synchronization. These are aspects that contribute to the performance characteristics of the
The paper proposes to use high-level architectural patterns as a basis for translating software
architecture into performance models. A subset of such patterns, which are later used in the case
study, are described in the paper in the form of UML collaborations (not to be confused with
UML collaboration diagrams, a type of interaction diagrams). According to the authors of UML,
6
of classes, interfaces, and other elements that work together to provide some cooperative
behaviour that is bigger than the sum of all of its parts” [2]. A collaboration has two aspects:
structural and behavioural. Fig. 1 and 2 illustrate these aspects for two alternatives of the pipeline
and filters pattern. Each figures contains on the left a UML class/object diagram describing the
pattern structure, and on the right a UML sequence diagram illustrating the pattern behaviour. A
brief explanation of the UML notation used in the paper is given below (see [2] for more details).
The notation for a class or object is a rectangle indicating the class/object name (the name is
underlined for objects); the rectangle may contain optionally a section for the class/object
operations, and another one for its attributes. The multiplicity of the class/object is represented in
the upper right corner. A rectangle with thick lines represents an active class/object, which has
its own thread of control, whereas a rectangle with thin lines represents a passive one. An active
a passive object, as in Fig. 2, indicates that the callers must coordinate outside the passive object
(for example, by the means of a semaphore) so that only one calls the passive object’s operations
at any given time. The UML symbol for collaboration is an ellipse with a dashed line that may
have an “embedded” rectangle showing template classes. The collaboration symbol is connected
with the classes/objects with dashed lines, whose labels indicate the roles played by each
component. A line connecting two objects, named link, represents a relationship between the two
objects which interact by exchanging messages. Depending on the kind of interacting objects
(passive or active), UML messages may represent either operation calls, or actual messages sent
between different flows of control. Links between objects may be optionally annotated with
arrows showing the name and type of messages exchanged. For example, in Fig.1 an arrow with
7
a half arrowhead between the active objects filter1 and filter2 represents an asynchronous message,
whereas in Fig.2 the arrows with filled solid arrowheads labeled “write()” and “read()” represent
synchronous messages implemented as calls to the operations indicated by the label. When
relevant, the “object flow” carried by a message is represented by a little arrow with a circle (as
in Fig.2), while the message itself is an arrow without circle. A synchronous message implies a
reply, therefore it can carry objects in both directions. For example, in Fig.2, the object flow
carried by the message read() goes in the reverse direction than the message itself.
A sequence diagram, such as on the right side of Fig.1 and 2, shows the messages exchanged
between a set of objects in chronological order. The objects are arranged along the horizontal
axis, and the time grows along the vertical axis, from top to bottom. Each object has a lifeline
running parallel with the time axis. On the lifeline one can indicate the period of time during
which an object is performing an action as a tall thin rectangle called “focus of control”, or the
state of the object as a rectangle with rounded corners called “state mark”. The messages
horizontal directed lines. An object can also send a message to itself, which means that one of its
Architectures using the pipeline and filters pattern divide the overall processing task into a
number of sequential steps which are implemented as filters, while the data between filters flows
through unidirectional pipes. We are interested here in active filters [3] that are running
concurrently. Each filter is implemented as a process or thread that loops through the following
steps: “pulls” the data (if any) from the preceding pipe, processes it, then “pushes” the results
down the pipeline. The way in which the push and pull operations are implemented may have
8
filter “pulls” an item by accepting the message sent by the previous filter, processes the item by
invoking its own operation proc_item(), passes the data on to the next filter by sending an
asynchronous message, after which goes into a waiting state for the next item. In Fig.2, the filters
communicate through a shared buffer (one pushes by writing to the buffer, and the other pulls by
reading it). Whereas the filters are active objects with a multiplicity of one or higher, the buffer
itself is a passive object that offers two operations, read() and write(), which must be used one at a
When defining the transformations from architectural patterns into LQN sub-models, we use
both the structural and the behavioural aspect of the respective collaborations. The structural part
is used directly, in the sense that each software component has counterpart(s) in the structure of
the LQN model (the mapping is not bijective). However, the behavioural part is used indirectly,
in the sense that it is matched by the behaviour of the LQN model, but is not represented
graphically.
3. LQN Model
LQN was developed as an extension of the well-known Queueing Network (QN) model, at first
independently in [17, 18] and [11], then as a joint effort [4]. The LQN toolset presented in [4]
includes both simulation and analytical solvers that merge the best previous approaches. The
main difference of LQN with respect to QN is that a server, to which customer requests are
arriving and queueing for service, may become a client to other servers from which it requires
nested services while serving its own clients. An LQN model is represented as an acyclic graph
whose nodes (named also tasks) are software entities and hardware devices, and whose arcs
denote service requests (see Fig.3). The software entities are drawn as parallelograms, and the
hardware devices as circles. The nodes with outgoing and no incoming arcs play the role of pure
9
clients. The intermediate nodes with incoming and
The leaf nodes are pure servers, and usually Proc1 Proc2
Application
represent hardware servers (such as processors, I/O
Proc3
or hardware server node can be either a single-server
Disk1 Disk2
or a multi-server (composed of more than one
identical servers that work in parallel and share the Figure 3. Simple LQN model
same request queue). A LQN task may offer more than one kind of service, each modelled by a
so-called entry, drawn as a parallelogram “slice”. An entry has its own execution time and
demands for other services (given as model parameters). Although not explicitly illustrated in
the LQN notation, each server has an implicit message queue, where the incoming requests are
waiting their turn to be served. Servers with more then one entry still have a single input queue,
where requests for different entries wait together. The default scheduling policy of the queue is
FIFO, but other policies are also supported. Fig. 3 shows a simple example of an LQN model for
a three-tiered client/server system: at the top there are two client classes, each with a known
number of stochastic identical clients. Each client sends requests for a specific service offered by
a task named Application, which represents the business layer of the system. Each Application
entry requires services from two different entries of the Database task, which offers in total three
kinds of services. Every software task is running on a processor node, drawn as a circle; in the
example, all clients of the same class share a processor, whereas Application and Database share
another processor. Database uses also two disk devices, as shown in Fig.3. It is worth mentioning
10
that the word layered in the name of LQN does not imply a strict layering of the tasks (for
example, a task may call other tasks in the same layer, or skip over layers).
There are three types of LQN messages, synchronous, asynchronous and forwarding, whose
effect is illustrated in Fig.4. A synchronous message represents a request for service sent by a
client to a server, where the client remains blocked until it receives a reply from the provider of
service (see Fig.4.a). If the server is busy when a request arrives, the request is queued and waits
Client Client
synchronous phase1 (service) reply
message
Server Server
busy idle
phase2 phase3
included services (autonomous phases)
Client Client
asynchronous phase1
message
Server Server
busy idle
phase2 phase3
included services
Client Client
synchronous
message
Server1 Server1
busy phase1 phase2 idle
forwarding reply to
original client
Server2 Server2
busy idle phase1 phase2 idle
c) LQN forwarding message
11
its turn. After accepting a message, the server starts to serve it by executing a sequence of phases
(one or more). At the end of phase 1, the server replies to the client, which is unblocked and
continues its work. The server continues with the following phases, if any, working in parallel
with the client, until the completion of the last phase. (In Fig.4.a, a case with three phases is
shown). After finishing the last phase, the server begins to serve a new request from the queue,
or becomes idle if the queue is empty. During any phase, the server may act as a client to other
servers, asking for and receiving so-called “included services”. In the case of an asynchronous
message, the client does not block after sending the message, and the server does not reply back,
only executes its phases, as shown in Fig.4b. The third type of LQN message, named forwarding
message (represented as a dotted request arc) is associated with a synchronous request that is
served by a chain of servers, as illustrated in Fig. 4.c. The client sends a synchronous request to
Server1, which begins to process the request, then forwards it to Server2 at the end of phase1.
Sever1 proceeds normally with the remaining phases in parallel with Server2, then goes on to
another cycle. The client, however, remains blocked until the forwarded request is served by
Server2, which replies to the client at the end of its phase 1. A forwarding chain can contain any
number of servers, in which case the client waits until it receives a reply from the last server in
the chain.
· for each software task entry: average execution time per phase;
· for each software task entry seen as a client to a device (i.e., for each request arc from
a task entry to a device): average service time at the device, and average number of
12
· for each software task entry seen as a client to another task entry (i.e., for each
request arc from a task entry to another task entry): average number of visits per
Typical results of an LQN model are response times, throughput, utilization of servers on behalf
of different types of requests, and queueing delays. The LQN results may be used to identify the
software and/or hardware bottlenecks that limit the system performance under different
workloads and configurations. Understanding the cause for performance limitations helps the
instances (each described by a pattern/collaboration), and a component may play different roles
in connections of various types. The transformation of the architecture into a performance model
is done in a systematic way, pattern by pattern. As expected, the performance of the system
depends on the performance attributes of its components and on their interactions (as described
itself, and must be supplied by the user as additional information. They describe the demands for
time demands for each software component on behalf of different types of system requests,
demands for other resources such as I/O devices, communication networks, etc. We will specify
more clearly what kind of performance attributes must be provided for each
13
pattern/collaboration. The tranformations from the architecture to the performance modelling
LQN model for Pipeline and Filters. Fig. 5 and 6 show the transformation into LQN
submodels of the two Pipeline and Filters collaborations described in Fig.1 and 2, respectively.
The translation takes into account, on one side, the structural and behavioural information
provided by the UML collaboration, and on the other side the allocation of software components
UpStreamFilter
DownStreamFilter
PIPELINE
WITH MESSAGE
filter1 filter2
UpStreamFilter
DownStreamFilter
Buffer
PIPELINE
WITH BUFFER
semaphore
a) All filters are running on the b) The filters are running on different
same processor node processor nodes
14
to processors, which will lead to different LQN submodels for the same pattern (see Fig.6). The
a) Each active filter from Fig.5 and 6 becomes an LQN software server with a single entry,
whose service time includes the processing time of the filter. The filter tasks will receive an
asynchronous message (described in Fig. 4.b) and will execute its phases in response to it. A
typical distribution of the work into phases is to receive the message in phase 1, to process it in
b) The allocation of LQN tasks to processors mimics the real system. The way the filters are
allocated on the same or on different processor nodes does not make any difference for the
pipeline with message (reason for which the processors are not represented in Fig.5), but it
c) In the case of a pipeline with message (Fig.5) all aspects related to the pipeline connector
between the two filters are completely modelled by the LQN asynchronous message. The CPU
times for send and receive system calls are added to the execution times of the phases in which
the respective operations take place. If we want to model a network delay for the message, it can
d) In the case of a pipeline with buffer (Fig.6), an asynchronous LQN arc is necessary, but is not
sufficient to model all the aspects concerning the pipeline connector. Additional LQN elements
are required to take into account the serialization delay introduced by the constraint that buffer
operations must be mutually exclusive. A third task that plays the role of semaphore will enforce
this constraint, due to the fact that any task serializes the execution of its entries. The task has as
many entries as the number of critical sections executed by the filters accessing the buffer (two
in this case, “write” and “read”). Since the execution of each buffer operation takes place in the
15
thread of control of the filter initiating the operation, the allocation of filters to processors
matters. If both filters are running on the same processor node (which may have more than one
processor) as in Fig.6.a, then the read and write operations will be executed on the same processor
node. Thus, they can be modelled as entries of the semaphore task that is, obviously, co-allocated
with the filters. If, however, the filters are running on different processor nodes, as in Fig.6.b, the
mutual-exclusive operations read and write will be executed on different processor nodes, so they
cannot be modelled as entries of the same task. (In LQN, all entries of a task are executed on the
same processor node). The solution is shown in Fig.6.b: we keep the semaphore task for
enforcing the mutual exclusion, but its entries are only used to delegate the work to two new
tasks, each one responsible for a buffer operation. Each new task is allocated on the same
The LQN models for both pipeline and filters collaborations from Fig. 5 and 6 can be generated
with a forwarding message (as in Fig.4.c) instead of an asynchronous one (as in Fig.4.b), if the
source of requests for the first filter in a multi-filter architecture is closed instead of open. A
closed source is composed of a set of client tasks sending synchronous requests to the first filter,
and waiting for a reply from the last filter. Since a LQN task may send a forwarding message
exactly at the end of its phase1, all the work done by a filter task must take place in the first
phase.
illustrated a case where the client communicates directly with the server through synchronous
requests (described in Fig.4.a). A server may offer a wide range of services (represented in the
architectural view as the server’s class methods) each one with its own performance attributes.
From a performance modelling point of view it is important not only to identify theses services,
16
Client
Server
CLIENT SERVER
Client Client
client client
1..n
<<process>>
1..n 1 2
<<process>>
client1 client2
service2 ()
service1() Server service2 () service1 service2 server
server
service1()
service2()
but also to find out who is invoking them and how frequently. The UML class diagram contains
a single association between a client and a server, no matter how many server methods the client
may invoke. Therefore, we indicate here in addition to the line that represents the client-server
association, the messages sent by the client to the server (used mostly in collaboration diagrams)
to indicate all the services a client will invoke at one time or another.
There are other ways in which client/server connections may be realized, which are not
described in the paper as they do not apply to our case study. A well-known example is the use
LQN was originally created to model client/server systems, so the transformation from the client-
server pattern to LQN is quite straightforward. An LQN server may offer a range of services
(object methods in the architectural view), each with its own CPU time and number of visits to
other servers (these are performance attributes that must be provided). Each service is modelled
as an entry of the server task, as shown in Figure 7, and will contribute differently to the
17
response time, utilization, and throughput of the server. A client may invoke more than one of
these services at different times. The performance attributes for the clients include their average
CPU time demands, and the average number of calls for each entry of the server. As in the
pipeline connection case, the CPU time required to execute the system call for send/receive/reply
are added to the service times of the corresponding entries. The allocation of tasks to processors
is not shown in Fig. 7, because the transformation does not depend on it. Each LQN task is
architectural patterns, but very frequently used. It describes the case where two or more active
objects share the same passive object. The constraint {sequential} attached to the methods of the
shared object indicates that the callers must coordinate outside the shared object (for example, by
performance delays, and must be represented in the LQN model. For simplicity reasons, Fig.8
illustrates a case where each user invokes only a method of the shared object, but this can be
The transformation of the critical section collaboration produces either the model given in Fig.
8.a or in Fig. 8.b, depending on the allocation of user processes to processor nodes (similar to the
pipeline with buffer case). The premise is that the shared object operations are mutually
exclusive, that an LQN task cannot change its processor node, and that all the entries of a task
are executed on the task’s processor node. In the case where all users are running on the same
processor node, the shared object operations can be modelled as entries of a task that plays the
role of semaphore (see Fig.8.a), which is running on the same processor node as the users. The
18
Accessor
Shared
CRITICAL SECTION
Accessor Accessor
<<process>> <<process>>
user1
... userN
f1 () Shared fN ()
shared
f1 () {sequential}
f2 () {sequential}
...
fN () {sequential}
semaphore
proc
proc1 f1
... fN procN
a) All users are running on the same b) The users are running on different
processor node processor nodes
generalization for allowing a user to call a subset of operations (entries) is straightforward: the
If the users are running on different processor nodes as in Fig. 8.b, then the shared object
operations (i.e., critical sections) are executed by different threads of controls corresponding to
different users that are running on different processors. Therefore each operation is modelled as
an entry of a new task responsible for that operation that is running on its user’s node. (If a user
is to call more shared operations, its new associated task will have an entry for every such
operation. This means that an operation called by more than one user will be represented by
more than one entry.) However, these new tasks must be prevented from running simultaneously,
19
Container
Contained dummyActive1 dummyActive2
COALLOCATION
Container
Contained container
Contained
<<process>>
active_container
active1 active2
active1
proc
active2
dummy
proc
so a semaphore task, with one entry for each user, is used to enforce the mutual exclusion. An
entry of the semaphore task delegates the work to the entries modelling the required operations.
The performance attributes to be provided for each user must specify the average execution times
for each user outside and inside the critical section separately.
Co-allocation collaboration. Fig. 9 shows a the so-called co-allocation collaboration, where two
active objects are contained in a third active object, and are constrained to execute only one at a
time. The container object may be implemented as a process. This is an example of architectural
connection from our case-study system, which is not necessarily an architectural pattern, but is
quite frequently used. The most obvious solution to model the two contained objects as entries of
the same task presents a disadvantage: it cannot represent the case where each of the two
contained objects has its own request queue. (An LQN task has a unique message queue, where
requests for all entries are waiting together). One reason for which we may need separate queues
is to avoid cyclic graphs, which could not be accepted by the LQN solver used for this paper.
The solution presented in Fig.9 represents each contained active object as a separate “dummy”
task that delegates all the work to an entry of the container task, which serializes all its entries.
The dummy tasks are allocated on a dummy processor (not to interfere with the scheduling of the
20
5. LQN Model of a Telecommunication System
We conducted performance modelling and analysis of an existing telecommunication system
which is responsible for developing, provisioning and maintaining various intelligent network
services, as well as for accepting and processing real-time requests for these services. According
to the Software Performance Engineering methodology [14], we first identified the critical
scenarios with the most stringent performance constraints (which correspond in this case to real-
time processing of service requests). Next we identified the software components involved in,
and the architectural patterns exercised by the execution of the chosen scenarios (see Fig.10).
Container Container
Contained Contained
COALLOCATION COALLOCATION
UpStrmFilter UpStrmFilter
DownStrmFilter DownStrmFilter
PIPELINE PIPELINE Buffer
WITH MESSAGE WITH BUFFER
RequestHandler
StackOut IOout outBuffer
UpStrmFilter UpStrmFilter
Client
DownStrmFilter DownStrmFilter
Server
PIPELINE PIPELINE Buffer
WITH MESSAGE WITH BUFFER CLIENT SERVER
alloc() {sequential}
DataBase
update() {sequential}
free() {sequential}
Accessor Accessor
Shared Shared
21
The real time scenario we have modelled starts from the moment a request arrives to the system
and ends after the service was completely processed and a reply was sent back. As shown in Fig.
10, a request is passed through several filters of a pipeline: from Stack process to IO process to
RequestHandler and all the way back. The main processing is done by the RequestHandler (as it
can be seen from Fig.12 and Table 1), which accesses a real-time database to fetch an execution
"script" for the desired service, then executes the steps of the script accordingly. The script may
vary in size and types of operations involved, and hence the workload varies largely from one
type of service to another (by one or two orders of magnitude). Based on experience and
intuition, the designers decided from the beginning to allow for multiple replications of the
RequestHandler process in order to speed up the system. Two shared objects, ShMem1 and
ShMem2, are used by the multiple RequestHandler replications. The system was intended to be
scheduling is such that any process can run on any free processor (i.e., the processors were not
dedicated to specific tasks). Therefore, the processor node was modelled as a multi-server. By
l
StackIn IOin
Dummy Request
Proc Handler
StackOut IOout
IOexec update
StackExec DataBase
ShMem2
Buffer ShMem1
Proc
22
applying systematically the transformation rules described in the previous section to the
architectural patterns/collaborations used in the system, as shown in Fig. 10, the LQN model
The next step was to determine the LQN model parameters (average service time for each entry,
and average number of visits for each request arc) and to validate the model. We have made use
of measurements using Quantify [19] and the Unix utility top. The measurements with Quantify
were obtained at very low arrival rates of around a couple of requests/second. Quantify is a
profiling tool which uses data from the compiler and run-time information to determine the user
and kernel execution times for test cases chosen by the user. Since we wanted to measure
average execution times for different software components on behalf of a system request (see
Table 1 in the Appendix), we have measured the execution of 2000 requests repeated in a loop,
then computed the average per request. Although we have not computed confidence intervals on
the measurements, repeated experiments were in close agreement. The top utility provided us
with utilization figures for very high loads of hundreds of requests/second, close to the actual
TOTAL
DataBase
RequestHandler
Non-critical section
Critical sect:Buffer
IOout Critical sect:ShMem1
Critical sect:ShMem2
Total
IOin
StackOut
StackIn
0 1 2 3 4 5
Execution time (msec)
Figure 12. Distribution of the total demand for CPU time per request
over different software components
23
operating point. These measurements were done on a prototype in the lab for two different
hardware configurations, with one and four processors. Again, repeated measurements were in
close agreement. We have used the execution times measured with Quantify (given in Table 1 in
the Appendix) to determine the model parameters, and the utilization results from top to validate
our model. The utilization values obtained by solving the model were within 5% of the measured
values. Unfortunately, a more rigorous validation was hindered by the lack of response time
measurements.
used in this section were obtained by simulation. The reason is that one of the system features,
namely the scheduling policy by polling used for the RequestHandler multi-server, could not be
handled by the analytical solver. All the simulation results were obtained with a confidence
1200
1000
Throughput [requests/second]
800
(n-1) RHs
n RHs
600
(n+1) RHs
(n+2) RHs
400
200
0
n=1 processor n=4 processors n=6 processors
Configurations
24
The first question of interest to developers was to find the "best" hardware and software
configuration that can achieve the desired throughput for a given mix of services. By
system and the number of RequestHandler software replications. We tried to answer this
question by exploring a range of configurations for a given service mix, determining for each the
highest achievable throughput (as in Fig. 13). Then the configurations with a maximum
throughput lower then the required values are discarded. The cheapest configuration that can
insure satisfactory throughput and response time at an operating point below saturation will be
chosen. Solving the LQN model is a more efficient way to explore different configurations under
a wide range of workloads then to measure the real system for all these cases.
Although we have modelled the system for two classes of services, we selected to report here
only results for a single class because they illustrate more clearly how the bottleneck moves
around from hardware to software for different configurations. The model was analyzed for three
hardware configurations: with one, four and six processors, respectively. We chose one and four
processors since the actual system had been run for such configurations, and six processors to see
When running the system on a single processor (see Fig. 14), the replication of the
RequestHandler does not increase the maximum achievable throughput. This is due to the fact
that the processor is the system bottleneck. As known from [8], the replication of software
processes brings performance rewards only if there is unused processing capacity (which is not
A software server is said to be “utilized” when is doing effective work and when is waiting to be
served by lower level servers (including the queueing for the included services). Fig. 17 and 18
25
show different contributions to the
1-Processor Configuration: Base Case
utilization of two tasks, IOout and
1
Processor
IOExec
works at the highest achievable
Utilization
0.6
StackExec
0.2
IOout
does little useful work, whereas at the 1
Processor
IOExec
0.8
StackExec
same time another task (i.e., a RH
Utilization
Database
0.6
processor configuration we notice that Figure 15. Task Utilizations for 4-proc base case
IOexec Processor
0.8
the maximum utilization level, as DataBase
Utilization
0.6 RequestHandler
0
are actually responsible for little 5 6 7 8 9 10 11 12 13 14 15 16 17 18
useful work on behalf of a system Figure 16. Task utilizations for 6-proc base case
request) reach critical levels of utilization due to serialization constraints in the software
26
architecture. There are two reasons
4-Proc Base Case: Contributions to IOout utilization
Utilization
IOout waiting for IOexec
control (which in LQN means 0.6
0.4
waiting for task IOexec), and ii) IOexec waiting for semaphore / processor
0.2
RequestHandler is
0.8 busy while waiting
processors configuration, where the for nested services
Utilization
0.6
Useful work by
processor utilization reaches only 0.4
others on behalf
of RequestHandler
We tried to eliminate the serialization constraints in two steps: first by making each filter a
process on its own (i.e., by removing StackExec and IOexec tasks in the LQN model), then by
splitting the pipeline buffer sitting between IO process and RequestHandler in two buffers. The
LQN model obtained after the 2-step modifications is shown in Fig. 19. The results of the “half-
way” modified system (after the first step only) are given in Fig. 20, and show no major
performance improvement (the processor is still used below capacity). By examining again the
utilization components of IOin and IOout (which are still the bottleneck) we found that they
27
l
StackIn IOin
Request
Handler
StackOut IOout
Proc
wait most of the time to gain access to the Buffer (IOout is waiting about 90% of the time and
IOin 80%). After applying both modification steps, though, the software bottleneck due to
excessive serialization in the pipeline was removed, and the processor utilization went up again,
as shown in Fig. 21 for the 4-processor and in Fig. 22 for the 6-processor configuration.
As expected, the maximum achievable throughput increased as well. The throughput increase
was rather small in the case of 4 processors (only 2.7%), and larger in the case of 6 processors
(of 10.3%), where there was more unused processing power. We also realized that in the case of
6 processors a new software bottleneck has emerged, namely the Database process (which is now
100% utilized). The new bottleneck, caused by a low-level server, has propagated upwards,
saturating all the software processes that are using it (all RequestHandler replications).
The final conclusion of the performance analysis is that different configurations will have
different bottlenecks, and by solving the bottleneck in one configuration, we shift the problem
somewhere else. What performance modelling has to offer is the ability to explore a range of
design alternatives, configurations and workload mixes, and to detect the causes of performance
limitations just by analyzing the model, before proceeding to change the real system.
28
7. Conclusions
4-Processors Configuration: half-way modified system
IOout / IOin
bridging the gap between software 1
Processor
0.8
Utilization
DataBase
0.6
RequestHandler
0
2 4 6 8 10 12 14 16 18 20
models from the high-level software Number of RequestHandler replications
1
Processor
StackIn
of transformations presented in the 0.4
StackOut
0.2
0.6 StackOut
StackIn
by applying it to an existing
0.4
weaknesses in the original Figure 22. Task utilization for the 6-proc fully modified
system
29
architecture due to excessive serialization, which show up when more processing power is added
to the system. Surprisingly, software components that do relatively little work on behalf of a
system request can become the bottleneck in certain cases, whereas components that do most of
the work do not. After removing the serialization constraints, a new software bottleneck emerges,
which leads to the conclusion that the software architecture as it is does not scale up well. This
case study illustrates the usefulness of applying performance modelling and analysis to software
architectures.
References
[1] R.Allen, D. Garlan, “A Formal Basis for Architectural Connection”, ACM Transactions on
[2] G.Booch, J.Rumbaugh, I.Jacobson, The Unified Modeling Language User Guide, Addison-
Wesley, 1999.
[4] G. Franks, A. Hubbard, S. .Majumdar, D. Petriu, J. Rolia, C.M. Woodside, “A toolset for
30
[7] J.Dilley, R.Friedich, T.Jin, J.Rolia, "Measuremnt Tool and Modelling Techniques for
Springer, pp.155-168, R.Marie, B.Plateau, M.Calzarosa, G.Rubino (eds), Proc. of 9-th Int.
Conference on Modelling Techniques and Tools for Performance Evaluation, June 1997.
[9] D. Petriu, X.Wang, “Deriving Software Performance Models from Architectural Patterns
Workshop on Software and Performance, Santa Fe, USA, pp.107-119, Oct. 1998.
[11] J.A. Rolia, K.C. Sevcik, “The Method of Layers”, IEEE Trans. On Software Engineering,
[13] M. Shaw, “Some Patterns for Software Architecture” in Pattern Languages of Program
Design 2 (J.Vlissides, J. Coplien, and N. Kerth eds.), pp.255-269, Addison Wesley, 1996.
[14] C.U. Smith, Performance Engineering of Software Systems, Addison Wesley, 1990.
Conference on Software Eng. and Knowledge Eng. SEKE’98, pp. 146-151, 1998.
31
[16] L.G Williams, C.U.Smith, “Performance Evaluation of Software Architectures”,
Proceedings of the First International Workshop on Software and Performance, Santa Fe,
[17] C.M. Woodside. "Throughput Calculation for Basic Stochastic Rendezvous Networks".
[18] C.M. Woodside, J.E. Neilson, D.C. Petriu, S. Majumdar, “The Stochastic Rendezvous
Appendix
32