White Paper Operational Acceptance Testing en
White Paper Operational Acceptance Testing en
Operational
Acceptance
Testing
Business continuity assurance
December 2012
Dirk Dach, Dr. Kai-Uwe Gawlik, Marc Mevert
SQS Software Quality Systems
Contents
1.
Management Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.
4.
5.
6.
Acceptance Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
7.
8.
Bibliographical References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Page 2
1. Management Summary
Operational Acceptance is the last line of defence between a software development project and the productive use of the
software. If anything goes wrong after the handover of the finished project to the application owner and the IT operation
team, the customers business will immediately be affected by any negative consequences. In order to minimise the risk of
going live, operational acceptance testing is the instrument of choice. The ISTQB defines Operational Acceptance Testing (OAT)
as follows:
Operational testing in the acceptance test phase, typically performed in a (simulated) operational environment by
operations and / or systems administration staff focusing on operational aspects, e.g. recoverability, resource-behaviour,
installability and technical compliance.
Definition 1: Operational acceptance testing (International Software Testing Qualifications Board, 2010)
This definition points out the growing importance of activities connected with OAT in times of increasing cloud computing
which poses additional risks for business continuity. The present whitepaper will give a complete overview of OAT with respect
to all relevant quality aspects. It will show that OAT is not only restricted to a final acceptance phase but can be implemented
systematically following best practices so as to minimise the risks for day one and beyond.
2. Introduction
When the project teams have completed the development of the software, it is released and handed over to the operation
team and the application owner. It immediately becomes part of the business processes as it is launched into the production
environments. Consequently, known and unknown defects of the software will directly impact business continuity
and potentially cause damage to a greater or lesser extent. In addition, responsibility typically is transferred from the
development units to two stakeholders:
Application Owner: The single point of contact for business units concerning the operation of dedicated applications.
These units are the internal part of the line organisation managing the maintenance, and constitute the interface to the
operation team.
Operation Team: An internal or external team deploying and operating software following well-defined processes, e.g.
tailored by ITIL (APM Group Limited, 2012). These units are equipped with system and application utilities for managing
the operation, and they will contact the application owners if bigger changes should be necessary to guarantee business
continuity.
Page 3
OAT subsumes all test activities performed by application owners and operation teams to arrive at the acceptance decision
with respect to their capability to operate the software under agreed Service Level Agreements and Operation Level
Agreements (SLAs / OLAs). These agreements provide measurable criteria which if they are fulfilled implicitly ensure
business continuity. Operation capability covers three groups of tasks:
Transition: This designates the successful deployment of software into the production environment using existing system
management functionality. The production environment could be a single server or thousands of workstations in agencies.
Software deployment includes software, database updates, and reorganisation but also fallback mechanisms in case of
failures. The operation team has to be trained and equipped with the respective tools.
Standard Operation: This designates successful business execution and monitoring on both the system and the
application level. It includes user support (helpdesk for user authorisation and incident management) and operation in case
of non-critical faults, e.g. switching to a mirror system in case of failure analysis on the main system.
Crisis Operation: System instability causes failures and downtimes. Operation teams must be able to reset the system and
its components into defined states and continue stable standard operation. Downtimes have to be as short as possible, and
reset must be achievable without data or transaction loss and/or damage.
Operational activities are divided between the application owner and the operation team, depending on the individual
organisation, and due to compliance requirements they must be traceable. Division of work has to be taken into account for
OAT.
For many years now, functional testing has been performed systematically, and its methods can directly be applied to OAT:
Risk-based approach: Test effort is spent on components involving the most probable and greatest possible damage. This
is done in order to achieve the highest quality on a given budget.
Early error detection: Test activities are executed as soon as possible in the software development life cycle because
the earlier any defects are found the less will be the cost of correction. OAT activities are allocated to all phases of the
software development and coupled with quality gates to follow the cost-saving principle (see Figure 1).
The present paper gives an overview of how to use ISO 25000 (ISO/IEC JTC 1/SC 7 Software and Systems Engineering, 2010)
to scope out OAT systematically, how to adjust activities along the software life cycle, and how to apply test methods
successfully by having application owners and operation teams involve other stakeholders like architects, developers, or
infrastructure.
Page 4
Business
Ope
ration Teams
Transition
Standard
Operation
Crisis
Operation
Ap
Analysis
Design
plic
atio n O wne
rs
Implementation
Functional Testing
Operation
Quality Gates
Operational Acceptance Testing
Page 5
From a testers perspective, many companies neglect the definition of requirements for operating software systems. Quite
often, not just defects but also gaps in mandatory functionality are only identified shortly before release. These features
will not only be missing in production, they already cause delays when executing E2E tests. For this reason, the functional
and non-functional requirements of operation teams have to be systematically managed, implemented, and tested, just like
business functions.
In times of cloud computing and operation outsourcing, it is of great importance to support business continuity and ensure
trust by verifying operation under agreed SLAs / OLAs. Especially monitoring is crucial to achieve transparency and obtain
early indicators in order to avoid incidents. OAT should follow a systematic approach so as to mitigate the risks of cloud
providers and outsourcing partners, because companies which offer services using these types of sourcing will not notice any
incidents but their impact will be felt directly by the clients.
Page 6
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Recovery Testing
x
x
x
x
x
x
Failover Testing
x
x
x
x
x
x
x
x
x
x
x
x
x
Security Testing
Installation Testing
x
x
x
Transition
Standard
Crisis
Functional suitability
Functional completeness
Functional correctness
Functional appropriateness
Performance efficiency
Time behaviour
Resource utilisation
Capacity
Compatibility
Coexistence
Interoperability
Usability
Appropriateness recognisability
Learnability
Operability
User error protection
User interface aesthetics
Accessibility
Reliability
Maturity
Availability
Fault tolerance
Recoverability
Security
Confidentiality
Integrity
Non-reputation
Accountability
Authenticity
Maintainability
Modularity
Reusability
Analysability
Modifiability
Testability
Portability
Adaptability
Installability
Replaceability
Code Analysis
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Page 7
Test activities can be allocated according to their main focus on the transition, standard or crisis mode, respectively. However,
L&P testing for instance addresses standard operation as well as the crisis mode, the latter including the case of system load
entering the stress region with the objective of crashing the system. Consequently, there is a certain overlap as to which
mode is addressed. Test preparation and execution is performed all along the software life cycle so that defects are detected
as early as possible. At the end of the design phase, architecture analysis applying ATAM (Software Engineering Institute, 2012)
or FMEA (Wikipedia contributors, 2012) based on use cases will guarantee fulfilment of all recovery requirements. But not only
that: these use cases can be re-employed as test cases for execution during the test phase.
Page 8
Input: A checklist showing the relevant documents and (if possible) the documents under review should be available.
Depending on the system, the documents included in the checklist may vary, such as:
Blueprints (including hardware and software)
Implementation plan (Go-live Plan, i.e. the activities required to go live mapped onto a timeline)
Operational documentation (providing guidance on how to operate the system in the transition, standard/daily and crisis
modes)
Architectural overviews (technical models of the system to support impact analysis and customisation to specific target
environments)
Data structure explanations (data dictionaries)
Disaster recovery plans
Business continuity plans
Test Approach: The task of the OAT tester is to ensure that these documents are available finalised and in sufficient quality
to enable smooth operation. The documents must be handed over and accepted by the teams handling the production
support, helpdesk, disaster recovery, and business continuity. The handover must be documented. Operation teams must be
included as early as possible to obtain their valuable input about the documentation required for operating the system. This
way, transparency about the completeness of documents is achieved early on in the development process, and there will be
sufficient time left for countermeasures if documents are missing or lacking in quality.
Output: A protocol is to record the handover to the operation team including any missing documents and open issues in the
documentation that still need to be sorted before going live.
OAT Test Environment: In order to review the documents, no special test environment is required. It is, however, essential for
document management to trace the right versions and record the review process.
Risk of Not Executing: If the operational documentation review is omitted or performed too late, testing will start without
the final documentation available which reduces the effectiveness of the tests. As a result, an increased number of issues may
be raised, causing delays in the timeline. For operation teams, not having the correct documentation can affect their ability to
maintain, support, and recover the systems.
Page 9
Page 10
OAT Test Environment: Depending on the programming language and tool code, analysis is directly performed inside an IDE
or on a separate platform using extracted code.
Risk of Not Executing: During incidents or in problem cases, there is a high risk of side effects. Moreover, problems can only
be fixed with delays due to the great effort involved in regression testing.
Page 11
Test Approach:
Deriving principal scenarios from operational models
Refining scenarios to sequence planning with respect to each role
Preparing and executing the test with the operation teams
Deciding on acceptance or rejection
Output: Software or process changes are carried out in order to correct defects. Those that cannot be corrected but are not
critical are collected and communicated to the helpdesk, business units and users.
OAT Test Environment: A dress rehearsal requires a production-like E2E test environment. Alternatively, a future operation
system not yet in use can be employed for dress rehearsal purposes. Activities in the crisis mode, however, may tear down
complete environments so that they will need to be refreshed extensively. There should be dedicated environments set up for
crisis mode tests in order to avoid delays of parallel functional testing.
Risk of Not Executing: Initially, processing will be performed by insufficiently skilled staff. Therefore, it is highly probable
that incidents may go unnoticed and data may be irreversibly corrupted. There is a high risk of running into disaster after
passing the point of no return.
Page 12
Parties Involved: Since the installer is part of the software, it is provided by the projects. Application owner, development
and operation team should collaborate closely when creating and testing the installer, while analysis may also be performed
by an independent tester or testing team. In the early stages, the focus is on the installer. Later on, the integration into
system management functions (e.g. automated deployment to 60,000 workstations) is addressed using tested installers.
Time: Irrespective of performing installation tests when the application is ready to be deployed, the required resources
(registry entries, libraries) should be discussed as early as possible with operation teams in order to recognise and avoid
potential conflicts (e.g. with other applications, security policies). Installation testing in different test environments can be a
systematic part of the softwares way from development to production.
Input: What is required are target systems with different configurations (e.g. created from a base image), as well as the
specification of changes to the operating system (e.g. registry entries), and a list of additional packages that need to be
installed. Additional input is provided by a checklist of necessary test steps (see Test Approach), as well as installation
instructions listing prerequisites and run-time dependencies as a basis for the creation of test scenarios.
Test Approach:
The following aspects must be checked to ensure correct installation and de-installation:
The specified registry entries and number of files available need to be verified after the installation is finished. Registry
entries and installed files should be removed completely using the de-installation routine. This will ensure that the software
can be removed quickly (returning the system to its previous state) in case any problems arise after the installation. The
same applies to additional software that may be installed with the application (e.g. language packs or frameworks).
The application must use disk space as specified in the documentation in order to avoid problems with insufficient space
on the hard disk.
If applicable, the installation over old(er) version(s) must be tested as well the installer must correctly detect and remove
or update old resources.
The occurrence of installation breaks has to be tested in each installer step, as well as breaks due to other system events
(e.g. network failure, insufficient resources). The application and the operating system must be in a defined and consistent
state after every installation break possible.
Handling shared resources is another issue. Shared resources may have to be installed or updated during installation, and
while these processes are performed, conflicts with other applications must be avoided. In terms of deinstallation, the
system should be cleaned from shared resources that are not used any more. This will result in higher performance and
increased security for the whole system.
Since the installation routine is an integral part of the application and a potentially complicated software process, it is
subject to the regulations of ISO 25010.
Page 13
Page 14
Parties Involved: Normally, operation teams, infrastructure teams or vendors are responsible for central components and
are following the market cycles of producers. When changes have been made, a regression test has to be performed by
application testing in projects or line organisation (application owner).
Time: There are two possible approaches for introducing central components: The first approach would be to set up central
components as productive within the development system, i.e. central components would move parallel to the application
software along the test stages towards a release date according to a common release plan. Testing would start implicitly
with developer tests. The second approach would be to test changing a central component in a production-like maintenance
landscape. In this case, a dedicated regression test would be performed parallel to production. Central components would be
released for both operation and development.
Input: This requires an impact analysis for introducing central components, and a regression test for the applications.
Test Approach:
Deriving relevant applications from impact analysis
Selecting regression tests on the basis of risk assessment
Performing regression tests (including job processing)
parallel to development
in a dedicated maintenance environment
Deciding on acceptance or rejection
Output: This method yields regression test results and locates possible defects.
OAT Test Environment: Depending on the approach, central component testing is performed in existing project test
environments or in a dedicated maintenance test environment.
Risk of Not Executing: System downtimes, missing fallbacks, and unresolvable data defects are all probable consequences.
This is of crucial importance for the use of external vendors or cloud computing.
Page 15
Objective: The aim of this test type is to enable operation teams to process new or changed flows and monitor progress by
applying monitoring and logging functionality. In this context, especially changed or new operation control is checked for its
operability. In case errors occur during tests, the recovery functionality is either trained or implemented additionally. The
helpdesk is made familiar with the system, which reduces support times after the release of the software.
Parties Involved: The E2E test management is responsible for tests. Operation teams are involved in the definition of test
scenarios by setting control points as well as by activation or deactivation of trace levels. In addition, processing is performed
in a production-like manner by staff which later will be responsible for performing the same activities in a production
environment.
Time: E2E test environment operation is synchronous to E2E tests. Therefore, test preparation can begin as soon as the
business concepts have been described and the operation functionality has been designed.
Input: The operational activities are determined on the basis of the business concepts, the operation design, and various
handbooks.
Test Approach:
Deriving a functional test calendar from E2E test cases
Creating the operation calendar by supplementing the test calendar with operational activities
Preparing and executing the test completely in a production-like manner, following the operation calendar
Evaluating the test
Correcting the operational procedures and job control
Deciding on acceptance or rejection
Output: As a result of this test, operation handbooks are corrected, and the operation team is fully trained for operation.
Corrected operational procedures and job control are implemented. Moreover, activities can be derived which serve close
environment monitoring during the first few weeks after the release of the software.
OAT Test Environment: Production-like E2E test environment operation is only possible in highly integrated test
environments that employ production-like processing and system management functions. There is no need, however, for
production-like sizing of the respective environment.
Risk of Not Executing: Initially, processing will be performed by insufficiently skilled staff. Therefore, it is highly probable
that incidents may go unnoticed and data may be irreversibly corrupted.
Page 16
Page 17
Parties Involved: The non-functional test management is responsible for the load and performance tests, while operation
teams are involved in the definition of test requirements in order to assess a system from the performance point of view.
Time: Load and performance testing is a discipline which comes into play very early in the software development life cycle,
starting with infrastructure evaluation and applying simple load scenarios (ping and packages) through to the testing of
prototypes and simulating realistic loads in production-like environments. Operational activities grow along this path but
already start very early on during the design phase.
Input: The basis for the operational activities is provided by the existing infrastructure landscape and by the requirements
for performance monitoring derived from system management functions which are supported by a dedicated tool set-up. The
scenarios are derived from the operation handbooks.
Test Approach:
Collecting test requirements from an operational point of view
Integrating requirements into the load model
Integrating requirements into the monitoring model
Preparing the test environment and test data
Executing and analysing the test
Defining mitigation scenarios for performance risks
Deciding on acceptance or rejection
Output: As result of the test, the operation handbooks are corrected, the operation team is trained to operate, and
activities to prepare the operational environment for the new software release are defined. Moreover, activities for intensive
environment monitoring during the first few weeks after the software is released can be derived.
OAT Test Environment: The test environments depend on the different test stages and are accompanied by dedicated load
and performance tests. There is a strong demand for performance tests in production-like environments applying production
processing and monitoring.
Risk of Not Executing: Performance issues may not be recognised after going live. In case of problems, no adequate
measures will have been prepared which can be applied fast. In a worst case scenario, an environment which is not prepared
to host an application that fulfils performance requirements would negatively influence business.
Page 18
Page 19
Output: A protocol records all the checked potential security problems (see Test Approach) and security issues that have
been found. It also includes suggestions for resolving these issues.
OAT Test Environment: OAT security testing has to be performed in the E2E test environment and live system.
Risk of Not Executing: There is a high risk that security vulnerabilities created during the last steps in the application
development life cycle (test data, debug functionality, misconfiguration) will affect the security of the system.
Page 20
Comparing original artefacts with restored ones, and also analysing the backup logs
If applicable, performing a roll-forward and checking again
Output: As a result of backup and restore testing, missing components, defects in existing components, and corrections to be
made in handbooks and manuals are identified. Also defects with respect to requirements are identified.
OAT Test Environment: If backup and restore functionality is available, testing can in principle be executed parallel to early
functional testing. However, since the tests will involve planned downtimes or phases of exclusive usage of environments,
additional test environments will be set up temporarily in order to avoid time delays in the functional testing area. Moreover,
this activity will require the following:
Representative test data
Established backup infrastructure
Established restore infrastructure
Risk of Not Executing: The possibility that backups (regular and ad hoc) may not work carries the risk of losing data in a
restore situation and can impede the ability to perform a disaster recovery. Restore times may increase due to not having a
prepared functionality but trying to solve problems in an unpredictable manner in a task force mode.
Page 21
Parties Involved: A system designer is needed to plan and implement the possible failure events for the test execution
based on ATAM use cases or FMEA. The failure and result analysis can require additional groups of people from database
administration and infrastructure management, e.g. technical staff member checking the uninterruptible power supply (UPS).
Processes und documentation have to be in place in order to handle the event.
Time: Failover testing can be launched when the test environment is set up and working in standard mode. To prevent
interruption of normal test activities, the test schedule has to consider a period of exclusive usage.
Input: The test scenarios are derived from architecture overviews, as well as experts experiences and methods. Handbooks
and manuals for failure handling also need to be available.
Test Approach: The test case specification has to describe the measures taken to trigger the failure event. It is not
necessary to execute events exactly as they happen in the real world since it can be sufficient to simulate them with technical
equipment. For instance:
Failure: Lost Network Connection
Activity: Remove Network cable
Failure: File system or hard disk failure
Activity: Remove hot swappable hard disk within a RAID system
Output: The failover execution protocols show the time slots needed to bring the system back up and running so that they
can be compared to the SLAs.
OAT Test Environment: Even though it is just a test, it can happen that the OAT test environment cannot be used in standard
mode after the failure event. A failover test may have an impact on further test activities and schedules. Consequently, this
kind of test is executed in dedicated temporary environments or functional test environments and carries the risk of time
delay.
Risk of Not Executing: Failover may not work or take longer than expected, thus leading to service outages.
Page 22
Objective: For OAT, this discipline focuses on the documentation, the described processes, and the resources involved. In the
phase needed to bring back the failed application, a downtime and reduced performance as well as a minimum of data and
business loss is to be expected. The degree of recoverability is defined by the achievable minimum of interrupted business
functions. Recovery testing is to ensure predictable and manageable recovery.
Parties Involved: In the event of a recovery request, different groups of administrators placed in operation teams are
involved: this includes the database, infrastructure, file system, and application software. These groups of people are
stakeholders who are interested in the quality of the recovery process but they are also key players during test execution.
Time: Since the recovery test is much more of a static test approach to the recovery plans and procedures, it can be applied
early on in the development process. Its findings can be integrated in the further system design and process development.
Dynamic test activities are executed parallel to functional testing.
Input: The recovery test can be performed with respect to processes, policies, and procedures. All documentation describing
the processes must be in place, and also recovery functionality has to be available.
Test Approach: Performing a test in a real OAT system with full interruption may be costly. Therefore, simulation testing can
be an alternative. ATAM or FMEA are static ways to determine the feasibility of the recovery process.
The following questions have to be considered during the tests:
What are the measures to restart the system? How long is the estimated downtime?
What are the measures to rebuild, re-install or restore?
What are the minimum performance and data loss volumes you can accept after recovery?
Output: The downtime i.e. counting from the moment when the service is not available any more to the moment when
the service is back up and running is estimated. As a result of recovery testing, missing components, defects in existing
components, and corrections to be made in handbooks and manuals are identified. Also defects with respect to requirements
are identified.
OAT Test Environment: If recovery functionality is available, testing can in principle be executed parallel to early functional
testing. However, since the tests will involve planned downtimes or phases of exclusive usage of environments, additional test
environments will be set up temporarily in order to avoid time delays in the functional testing area.
Risk of Not Executing: The failover process may not work or take longer than expected, leading to service outages, and
recovery could fall into task force mode with unpredictable downtimes.
Page 23
6. Acceptance Process
The objective of OAT is to achieve the commitment of a handover of the software to the application owner and the operation
team. Acceptance is based on successful tests and known but manageable defects throughout the entire life cycle of the
Operation
Installation
Installation
Integration Test
Installation
Component Test
Design
Analysis
High-intensity activities
Low-intensity activities
Implementation
OAT-QG4
OAT-QG3
OAT-QG2
OAT-QG1
Quality Gates:
Page 24
8. Bibliographical References
APM Group Limited, HM Government and TSO. ITIL Information Technology Infrastructure Library. High Wycombe,
Buckinghamshire. [Online] 2012. http://www.itil-officialsite.com.
International Software Testing Qualifications Board. ISTQB Glossary. [Online] March 2010.
http://www.istqb.org/downloads/glossary.html.
ISO/IEC JTC 1/SC 7 Software and Systems Engineering. ISO/IEC 25000 Software Engineering Software Product Quality
Requirements and Evaluation (SQuaRE) Guide to SQuaRE. [Online] 17/12/2010.
http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=35683.
Software Engineering Institute. Architecture Tradeoff Analysis Method. Carnegie Mellon University, Pittsburgh, PA. [Cited: 2012.]
http://www.sei.cmu.edu/architecture/tools/evaluate/atam.cfm.
Valentino-DeVries, J. More Predictions on the Huge Growth of Cloud Computing. The Wall Street Journal Blogs. [Online]
21/04/2011. http://blogs.wsj.com/digits/2011/04/21/more-predictions-on-the-huge-growth-of-cloud-computing/.
Wikipedia contributors. FMEA Failure Mode and Effects Analysis. [Online] 2012. http://de.wikipedia.org/wiki/FMEA.
Page 25
Authors:
Dirk Dach
Marc Mevert
Senior Consultant
Senior Consultant
sqs.com