The Architecture Tradeoff Analysis Method
The Architecture Tradeoff Analysis Method
PHASE IV PHASE I
Tradeoffs Scenario &
Requirements
Gathering
Identify Collect
Tradeoffs Scenarios
Collect
Identify Requirements,
Sensitivities Constraints,
Environment
Describe
Architectural
Views
Attribute
Specific
Analyses
Realize
Scenarios
...
ence of other analyses focuses attention on the differences of
the ATAM. We will analyze this system with respect to its Furnace Task16
availability, security, and performance attributes.
System Description Figure 2: The Architecture of a Furnace Server
The RTS (remote temperature sensor) system exists to mea-
sure the temperatures of a set of furnaces, and to report those Now that the server architecture has been described, we will
temperatures to an operator at some other location. In the present the overall system architectures. In each of the sys-
original example the operator was located at a “host” com- tems a set of 16 clients interacts with one or more servers,
puter. The RTS sends periodic temperature updates to the communicating via a network.
host computer, and the host computer sends control requests Architectural Option 1 (Client-Server)
to the RTS, to change the frequency at which periodic Option 1 is the baseline; a simple and inexpensive client-
updates are sent. These requests and updates are done on a server architecture, with a single server serving all 16 clients,
furnace by furnace basis. That is, each furnace can be report- as shown in Figure 3.
ing its temperature at a different frequency. The RTS is pre-
sumably part of a larger process control system. The control Furnace Client 1
part of the system is not discussed in this example, however.
RTS Server Furnace Client 2
We are interested in analyzing the RTS for the qualities of
performance, security, and availability. To illustrate these .
.
...
.
41,120 ms 5,100 ms 20,400 ms
.
IC Furnace Client 15 Table 1: Performance Analysis for Option 1
A worst case control latency of 41.12 seconds sounds like a
IC Furnace Client 16 bad thing. However, is it? To answer this question we must
understand the requirement better. How often will the worst
Figure 5: Option 3’s Architecture (with Cache) case occur? Is it ever tolerable to have the worst case occur?
For a safety-critical application, the answer might be “no”.
These then are our three initial architectural alternatives. To For an interactive Web-based application, the answer might
understand and compare them, we will analyze them using be “yes”, because the price of ensuring a smaller worst case
the ATAM. This method will aid us in understanding not
only the relative strengths and weaknesses of each architec- 1. WCCL = worst-case control latency, ACPL = average-case
ture, but will also provide a structured framework for elicit- periodic latency, and BCPL = best-case periodic latency.
is prohibitive. Doing an analysis of a single quality attribute elapses, the intelligent cache can pass a synthesized update
forces one to consider such requirements issues. to the client. When the actual update arrives, the cache
updates its state accordingly. Thus, if we trust the intelligent
The worst case periodic latency is 37.12 seconds. However,
cache, we can bound the worst case jitter to any desired
the worst case scenario is unlikely: it assumes that all fur- value. The smaller the bounding value the more likely a
naces are queried at the maximum rate (T(1) = 10), that all
given update will be synthesized by the intelligent cache
periodic updates are issued simultaneously, and that the
rather than coming directly from the server.
update being measured (the worst case update) is the last one
in the queue. More importantly, in this application the cost of Critique of the Analysis
a missed update is not great—another one will arrive in the This simple performance analysis gives insight into the char-
next T(i) seconds. Given these facts, we calculate the aver- acteristics of each solution early in the design process, as
age case latency, to see if the system can meet its deadlines befits an architectural level analysis. As more details are
under more normal conditions, and accept the fact that an required, the analyses can be refined, using techniques such
occasional periodic update might be missed. as RMA [5], SPE [8], simulation, or prototyping. More
importantly, a high-level analysis guides our future investi-
Finally, we turn to PR3, the “Jitter” requirement. Jitter is the
gations, highlighting potential performance “hot-spots”, and
amount of time variation that can occur between consecutive
allowing us to determine areas of architectural sensitivity to
periodic temperature updates. The requirement is that the jit- performance, which lead us to the location of tradeoff points.
ter be not more than 2T(i), which is a minimum of 20 sec-
onds for T(i) = 10. The interval between consecutive The ATAM thus promotes analysis at multiple resolutions as
readings will be not more than 2T(i) if the difference a means of minimizing risk at acceptable levels of cost.
between best case and worst case latency is not more than Areas of high risk are analyzed more deeply (perhaps simu-
2T(i), for this is an expression of jitter. So, the worst case jit- lated or prototyped) than the rest of the architecture. And
ter = BCPL - WCPL = 21,760 - 1,360 = 20,400 ms. This is each level of analysis helps determine where to analyze more
greater than the minimum 2T(1) of 20 seconds, and so option deeply in the next iteration.
1 cannot meet PR3.
AVAILABILITY ANALYSES
However, in evaluating architectural option 1’s response to We will initially only consider a single availability require-
PR3, we must ask “What is the cost of a missed update?”. Is ment for the RTS system:
it ever acceptable to violate this requirement? In some
safety-critical applications the answer would be “no”. In AR1: System must not be unavailable for more than 60
most applications, the answer would be “yes”, providing that minutes per year.
this occurrence was infrequent. The results of this evaluation The availability analysis considers a range of component
force one to reconsider the importance of meeting PR3. failure rates, from 0 to 24 per year. We only present the
Performance Analysis of Option 2 results for the case of 24 failures per year. We also consider
two classes of repairs, depending on the type of failure:
The performance characteristics of architectural option 2 are
summarized in Table 2. • major failures, such as a burned-out power supply, that
require a visit by a hardware technician to repair, taking
WCCL ACPL Jitter 1/2 a day; and
20,560 ms 2,550 ms 9,520 ms • minor failures, such as software bugs, that can be
“repaired” by rebooting the system, taking 10 minutes.
Table 2: Performance Analysis for Option 2
To understand the availability of each of the architectural
One point should be noted here, and will be returned to later options, we built and solved a Markov model. In this analy-
in this discussion: if one of the servers fails, option 2 has the sis, we only considered server availability.
performance and availability characteristics of option 1.
Availability Analysis of Option 1
Performance Analysis of Option 3 Solving the Markov model for option 1 gives the results
The performance characteristics of architectural option 3 are shown in Table 4: 279 hours of down time per year for the
summarized in Table 3. burned-out power supply and almost 4 hours down per year
for the faulty operating system.
WCCL ACPL Jitter
41,120 ms 5,200 ms ≤20,400 ms Repair Time Failures/yr Availability Hrs down/yr
12 hours 24. 0.96817 278.833
Table 3: Performance Analysis for Option 3
10 minutes 24. 0.99954 3.9982
For this analysis, we have added a new factor: servicing the
intelligent cache (adding a new update and recalculating the Table 4: Availability of Option 1
extrapolation model) takes 100 ms. In this case, the worst
Availability Analysis of Option 2
case jitter is exactly the same as for option 1, 20,400 ms. We would expect option 2 to have better availability than
However, the intelligent cache exists to protect the client
option 1, since each server acts as a backup for the other, and
against some amount of lost data. As a consequence, it can
we expect the probability of both servers being unavailable
bound the worst case jitter. When some pre-set time period
to be small. Solving the Markov model for this architecture, the detailed cost analyses can be found in [2]).
we get the results shown in Table 5. • Option 2 has excellent availability, but at the cost of extra
hardware. It also has excellent performance (when both
Repair Time Failures/yr Availability Hrs down/yr
servers are functioning), and the characteristics of option
12 hours 24. 0.99798 17.7327 1 when a single server is down.
10 minutes 24. ~1.0 0.0036496 • Option 3 has slightly better availability than option 1,
better performance than option 1 (in that the worst case
Table 5: Availability of Option 2 jitter can be bounded), slightly greater cost than Option
Table 5 shows that option 2 now suffers almost 18 hours of 1, and lower cost than Option 2.
down time per year in the burned-out power supply case. The conclusions that our analyses lead us to also cause us to
This indicates that architectural option 2 might still suffer ask some further questions.
outages if it encounters frequent hardware problems. On the
other hand, option 2 shows near-perfect availability in the Further Investigation of Option 2
operating system reboot scenario. The availability is shown For example, we need to consider the nature of option 2 with
as perfect 1.0 (the calculations were performed to 5 digits of a server failure. Given that option 2 is identical to option 1
accuracy). In the worst case of 24 annual failures option 2 when one server fails, and we have already concluded that
exhibits only 13 seconds of down time per year. option 1 has poor performance and availability, it is impor-
tant to know how much time option 2 will be in that reduced
Availability Analysis of Option 3 state of service. When we calculate the availability of both
Considering architectural option 3, we expect that it will servers, using our worst-case assumption of 24 failures per
have better availability characteristics than option 1, but year, we expect to suffer over 22 days of reduced service.
worse than option 2. This is because the intelligent cache,
while providing some resistance to server failures, is not Action Plan
expected to be as trustworthy as an independent server. Solv- Given this understanding of options 2 and 3, we see that
ing the Markov model, we get the results shown in Table 6 none of these completely meet their requirements. While
for a cache that is assumed to be trustworthy for 5 minutes. option 2 meets its availability target (for failures that involve
rebooting the server), it leaves the system in a state where its
Repair Time Failures/yr Reliability Hrs down/yr performance targets can not be met for more than 22 days
per year. Perhaps a combination of options 2 and 3—dual
12 hours 24. 0.96839 276.91 servers and intelligent cache on clients—will be a better
10 minutes 24. 0.9997 2.66545 alternative. This option will provide the superior availability
and performance of option 2, but during the times when one
Table 6: Availability of Option 3
server has failed, we mitigate the jitter problems of the single
The results in Table 6 show that the 5 minute intelligent remaining server by using the intelligent cache.
cache does little to improve option 3 over option 1 in the sce-
We could not have made these design decisions without
nario with the burnt-out power supply. Option 3 still suffers
knowledge gained from the analysis. Performing a multi-
over 277 hours of down time per year. However, the results attribute analysis allows one to understand the strengths and
for the reboot scenario look more encouraging. The cache
weaknesses of a system, and of the parts of a system, within
reduces down time to 2.7 hours per year. Thus, it appears
a framework that supports making design decisions.
that the intelligent cache, if its extrapolation was improved,
might provide high availability at low cost (since this option SENSITIVITY ANALYSES
uses a single server, compared with the replicated servers Given that the performance and availability of option 2 were
used in option 2). We return to this issue shortly. so much better than option 1, we would suspect that these
attributes are sensitive to the number of servers. Sensitivity
CRITIQUE OF THE OPTIONS
analysis confirms this: performance increases linearly as the
Now that we have seen two different attribute analyses, one number of servers increases (up to the point where there is 1
part of the method can be commented on: the level of granu-
server per client) and availability increases by roughly an
larity at which a system is analyzed. The ATAM advocates
order of magnitude with each additional server [2].
analysis at multiple levels of resolution as a means of mini-
mizing risk at acceptable investments of time and effort. Given that option 3 has some desirable characteristics in
Areas that are deemed to be of high risk are analyzed and terms of cost and jitter, we might ask if we can improve the
evaluated more deeply than the rest of the architecture. And intelligent cache sufficiently to make this option acceptable
each level of analysis helps to determine “hot spots” to focus from an availability perspective. To answer this, we plot
on in the next iteration. We will illustrate this point next. option 3’s availability against the length of time during
which the intelligent cache’s data is trusted. This plot is
The three architectures can be partially characterized and
understood by the measures that we have just derived. From
this analysis, we can conclude the following:
• Option 1 has poor performance and availability. It is also
the least expensive option (in terms of hardware costs;
shown in Figure 6. an acceptable window of opportunity for an intruder, we
define initial values that are reasonable for the functions
provided in the RTS architectures. These are:
Attack Components Value
(hours/year)
Down time
successful
Prob of
Spoof IP address 0.9
Cache life (minutes) Kill Connection 0.75
Kill Server 0.25
Figure 6: Down time vs. Intelligent Cache Life
Table 7: Environmental Security Assumptions
As we can see, an improved intelligent cache does improve
availability. However, the rate of improvement in availability In addition, we will posit two attack scenarios: one where the
as a function of cache life is so small that no reasonable, intruder uses a “man in the middle” (MIM) attack, and one
achievable amount of cache improvement will result in the where the intruder uses a “spoof server” attack.
kind of availability demonstrated for option 2. In effect, the For the MIM attack, the attacker uses a TCP intercept tool to
intelligent cache is an architectural barrier with respect to modify the values of the temperatures during transmission.
availability, because it can not be made to achieve the levels Since there are no specific security countermeasures to this
of utility required of it. To put it another way, the availability attack, the only barrier is the 60 minute window of opportu-
of option 3 is not sensitive to cache life. To increase the nity and the 0.5 probability of success for the TCP intercept
availability substantially, other paths must be investigated. tool. Thus the rate of successful attack is 0.025 systems/
minute, or about 1.5 successful attacks expected in the win-
SECURITY ANALYSES dow of opportunity.
Although we could have been conducting security analyses
with the performance and availability analyses from the For the spoof-server attack, there are three possible ways to
start, the ATAM does not require that all attributes be ana- succeed. The intruder could wait for a server to fail, then
lyzed in parallel. The ATAM allows the designer to focus on spoof that server’s address and take over the client connec-
those attributes that are considered to be primary, and then tions. This presumes that the intruder can determine when a
introduce others later on. This can lead to cost benefits in server has failed and can take advantage of this before the
applying the method, since what may be costly analyses for clients time out. Another successful method would be to
some secondary attributes need not be applied to architec- cause the server to fail (the “kill server” attack), then take
tures that were unsuitable for the primary attributes. Though over the connections. A third is to disrupt the connections
all analyses need not occur “up-front and simultaneously”, between the client and server, then establish new connec-
the analyses for the secondary attributes can still occur well tions as a spoofed server (the “kill connection” attack). For
before implementation begins. this analysis, it is presumed that the intruder is equally likely
to attempt any of these methods in a given attack and the
We will now analyze our three options in terms of their secu- results are summarized in Table 8. Of course, these numbers
rity. In particular, we will examine the connections between appear precise, but must be treated as estimates given the
the furnace servers and clients, since this could be the sub- subjective nature of the environmental assumptions.
ject of an attack. The object at risk is the temperature sent
from the server to the client, since this information is used by Attack Type Expected Intrusions in 60 Mins
the client to adjust the furnace settings. If the temperature is
Kill Connection 2.04
tampered with it could be a significant safety concern. Thus
we have the security requirement: Kill Server 0.66
SR1: The temperature readings must not be corrupted Server Failure 0.0072
before they arrive to the client. Table 8: Anticipated Spoof Attack Success Rates
Our initial security investigation of the architectural options
It should be noted that if the system must deal with switching
must, once again, make some environmental assumptions.
servers and reconnecting clients when a server goes in and
These assumptions are dependent on the operational envi-
ronment of the delivered system and include factors such as out of service, it will be easier for an intruder to spoof a
server and convince a client to switch to the bogus server.
operator training and patch management. These dependen-
We will return to this point in the sensitivity analysis.
cies are out of scope for the analysis at this level of detail,
but must be considered later in the design process. The results of this analysis show that in each case, it is
expected that a penetration will take place within 60 min-
So, to calculate the probability of a successful attack within
utes. For the MIM scenario, the expected number of success-
ful attacks is 1.5, indicating that an intruder would have form of intrusion detection reduces the number of expected
more than enough time to complete the attack before detec- intrusions by 1-2 orders of magnitude, giving a result com-
tion. For the spoof attack, the number of successful attacks parable to encryption, but at substantially lower performance
ranges from 0.0072 to just over 2, again showing that a pene- and software/hardware costs:
tration using this technique is also likely.
Attack Type Expected Intrusions in 60 Mins
Refined Architectural Options
To address the apparent inadequacy of the three options, we Kill Connection 0.16875
need to cycle around the steps of the ATAM, proposing new Kill Server 0.05625
architectures. The modified versions of the options include Server Failure 0.005
the addition of encryption/decryption and the use of the
intelligent cache as an intrusion detection mechanism, as Table 11: Spoof Attack Success Rates with Intrusion
shown in Figure 7. Detection
At this point, new performance and availability analyses will
E/D IC Furnace Client 1 need to be run to account for the additional functionality and
Furnace Server hardware required by the intelligent cache or encryption
E/D IC Furnace Client 2 modifications, thus instigating another trip around the spiral.
E/ .
...
Based on these assumptions, we can calculate the expected THE IMPLICATIONS OF THE ATAM
number of intrusions. Not surprisingly, the addition of For every assumption that we make in a system’s design, we
encryption has reduced these substantially—by at least an trade cost for knowledge. For example, if a periodic update
order of magnitude—in each option: is supposed to arrive every 10 seconds, do we want it to
arrive exactly every 10 seconds, on average every 10 sec-
Attack Type Expected Intrusions in 60 Mins onds, some time within each 10 second window? To give
Kill Connection 0.18225 another example, consider the requirement detailing the
worst case latency of control packets. As discussed earlier, is
Kill Server 0.03375 this worst case ever acceptable? If so, how frequently can we
Server Failure 0.0006 tolerate it? The process of analyzing architectural attributes
forces us to try to answer these questions. Either we under-
Table 10: Spoof Attack Success Rates with Encryption
stand our requirements precisely or we pay for ignorance by
Our analysis of the intelligent cache changes only one envi- over-engineering or under-engineering the system. If we
ronmental assumption: the “Attack Exposure Window” goes over-engineering, we pay for our ignorance by making the
down to 5 minutes, since we assume that an operator can system needlessly expensive. If we under-engineer, we face
detect and respond to an intrusion in that time. Using this system failures, losing customers or perhaps even lives.
Can we believe the numbers that we generated in our analy- new interactions between attributes which may require fur-
ses? No. However we can believe the trends—we have seen ther analysis, sometimes at different levels of abstraction.
differences among designs in terms of orders of magni- Such obstacles are an intrinsic part of a detailed methodical
tude—and these differences, along with sensitivity analysis, exploration of the design space and cannot be avoided. Man-
tell us where to investigate further, where to get better envi- aging the conflicts and interactions that are revealed by the
ronmental information, where to prototype, which will get us ATAM places heavy demands on the analysis skills of the
numbers that we can believe. Every analysis step that we individual attribute experts. Success largely depends upon
take precipitates new questions. While this seems like a the ability of those experts to transcend barriers of differing
daunting, never-ending prospect, it is manageable because terminology and methodology to understand the implications
these questions are posed and answered within an analytic of inter-attribute dependencies, and to jointly devise candi-
attribute framework, and because in architectural analysis date architectural solutions for further analysis. As burden-
we are more interested in finding large effects than in precise some as this may appear to be, it is far better to intensively
estimates. manage these attribute interactions early in the design pro-
cess than to wait until some unfortunate consequences of the
In addition to concretizing requirements, the ATAM has one
interactions are revealed in a deployed system.
other benefit: it helps to uncover implicit requirements. This
occurs because attribute analyses are, as we have seen, inter- CONCLUSIONS
dependent—they depend, at least partially, on a common set The ATAM was motivated by a desire to make rational
of elements, such as the number of servers. However, in the choices among competing architectures, based upon well-
past, they have been modeled as though they were indepen- documented analyses of system attributes at the architectural
dent. This is clearly not the case. level, concentrating on the identification of tradeoff points.
Each analyzed attribute has implications for other attributes. The ATAM also serves as a vehicle for the early clarification
For example, although the availability analysis was only of requirements. As a result of performing an architecture
focussed on servers availability, in a complete analysis we tradeoff analysis, we have an enhanced understanding of,
would look at potential failures of all components, including and confidence in, a system’s ability to meet its require-
the furnaces, the clients, and the communication lines, and ments. We also have a documented rationale for the architec-
we would look at the various failure types. One such failure tural choices made, consisting of both the scenarios used to
is dropping a message. If we assume that the communication motivate the attribute-specific analyses and the results of
channel is not reliable, then we might want to plan for re- those analyses.
sending messages. To do this involves additional computa- Consider the RTS case study: we began with vague require-
tion (to detect and re-send lost messages), storage (to store ments and enumerated three architectural options. The ana-
the messages until they have been successfully transmitted), lytical framework helped determine the useful characteristics
and time (for a time-out interval and for message re-trans- of each of the architectural options and highlighted the costs
mission). Thus one of the major implications of this avail- and benefits of the architectural features. More importantly,
ability concern is that the performance models of the options the ATAM helped determine the locations of architectural
under consideration need to be modified. tradeoff points, which helped us understand the limits of
To recap, we discover attribute interactions in two ways: each option. This helped us develop informed action plans
using sensitivity analysis to find tradeoff points, and by for modifying the architecture, leading to new evaluations
examining the assumptions that we make for analysis A and new iterations of the method.
while performing analysis B. The “no dropped packets”
assumption is one example of such an interaction. This REFERENCES
assumption, if false, may have implications for safety, secu- [1] M. Barbacci, M. Klein, C. Weinstock, “Principles for
rity, and availability. A solution to dropping packets will Evaluating the Quality Attributes of a Software Architec-
have implications for performance. ture”, CMU/SEI -96-TR-36, 1996.
In the ATAM attribute experts independently create and ana- [2] M. Barbacci, J. Carriere, R. Kazman, M. Klein, H. Lip-
lyze their models, then they exchange information (clarify- son, T. Longstaff, C. Weinstock, “Architecture Tradeoff
ing or creating new requirements). On the basis of this Analysis: Managing Attribute Conflicts and Interactions”,
information they refine their models. The interaction of CMU/SEI -97-TR-29, 1997.
attribute-specific analyses, and the identification of
tradeoffs, has a greater effect on system understanding and [3] B. Boehm, “A Spiral Model of Software Development
stakeholder communication than any of those analyses could and Enhancement”, ACM Software Eng. Notes, 11(4), 22-42,
do on their own. 1986.
The complexity inherent in most real-world software design [4] R. Kazman, G. Abowd, L. Bass, P. Clements, “Scenario-
implies that an architecture tradeoff analysis will rarely be a Based Analysis of Software Architecture”, IEEE Software,
straightforward activity that allows you to proceed linearly Nov. 1996, 47-55.
to a perfect solution. Each step of the method answers some
design questions, and brings some issues into sharper focus. [5] M. Klein, T. Ralya, B. Pollak, R. Obenza, M. Gonzales
However, each step often raises new questions and reveals Harbour, A Practitioner’s Handbook for Real-Time Analysis,
Kluwer Academic, 1993.
[6] H. Lipson, T. Longstaff (eds.), Proceedings of the 1997
Information Survivability Workshop, IEEE CS Press, 1997.
[7] J. McCall, “Quality Factors”, in (J. Marciniak, ed.),
Encyclopedia of Software Engineering, Vol. 2, Wiley: New
York, 1994, 958-969.
[8] C. Smith, L. Williams, “Software Performance Engi-
neering: A Case Study Including Performance Comparison
with Design Alternatives”, IEEE Transactions on Software
Engineering, 19(7), 720-741.