Data Warehouse Testing: An Exploratory Study
Data Warehouse Testing: An Exploratory Study
Software Engineering
Thesis no: MSE-2011-65
09 2011
School of Computing
Blekinge Institute of Technology
SE-371 79 Karlskrona
Sweden
This thesis is submitted to the School of Computing at Blekinge Institute of Technology in partial ful-
fillment of the requirements for the degree of Master of Science in Software Engineering. The thesis is
equivalent to 30 weeks of full time studies.
Contact Information:
Author(s):
Muhammad Shahan Ali Khan
E-mail: [email protected]
Ahmad ElMadi
E-mail: [email protected]
Industry advisor:
Annika Wadelius
Försäkringskassan IT
Address: Södra Järnvägsgatan 41, 851 93 Sundsvall, Sweden
Phone: +46 101 12 82 72
University advisor:
Dr. Cigdem Gencel
Context. The use of data warehouses, a specialized class of information systems, by organizations all
over the globe, has recently experienced dramatic increase. A Data Warehouse (DW) serves organiza-
tions for various important purposes such as reporting uses, strategic decision making purposes, etc.
Maintaining the quality of such systems is a difficult task as DWs are much more complex than ordi-
nary operational software applications. Therefore, conventional methods of software testing cannot be
applied on DW systems.
Objectives. The objectives of this thesis study was to investigate the current state of the art in DW
testing, to explore various DW testing tools and techniques and the challenges in DW testing and, to
identify the improvement opportunities for DW testing process.
Methods. This study consists of an exploratory and a confirmatory part. In the exploratory part, a
Systematic Literature Review (SLR) followed by Snowball Sampling Technique (SST), a case study
at a Swedish government organization and interviews were conducted. For the SLR, a number of
article sources were used, including Compendex, Inspec, IEEE Explore, ACM Digital Library,
Springer Link, Science Direct, Scopus etc. References in selected studies and citation databases were
used for performing backward and forward SST, respectively. 44 primary studies were identified as a
result of the SLR and SST. For the case study, interviews with 6 practitioners were conducted. Case
study was followed by conducting 9 additional interviews, with practitioners from different organiza-
tions in Sweden and from other countries. Exploratory phase was followed by confirmatory phase,
where the challenges, identified during the exploratory phase, were validated by conducting 3 more
interviews with industry practitioners.
Results. In this study we identified various challenges that are faced by the industry practitioners as
well as various tools and testing techniques that are used for testing the DW systems. 47 challenges
were found and a number of testing tools and techniques were found in the study. Classification of
challenges was performed and improvement suggestions were made to address these challenges in
order to reduce their impact. Only 8 of the challenges were found to be common for the industry and
the literature studies.
Conclusions. Most of the identified challenges were related to test data creation and to the need for
tools for various purposes of DW testing. The rising trend of DW systems requires a standardized
testing approach and tools that can help to save time by automating the testing process. While tools for
operational software testing are available commercially as well as from the open source community,
there is a lack of such tools for DW testing. It was also found that a number of challenges are also
related to the management activities, such as lack of communication and challenges in DW testing
budget estimation etc. We also identified a need for a comprehensive framework for testing data
warehouse systems and tools that can help to automate the testing tasks. Moreover, it was found that
the impact of management factors on the quality of DW systems should be measured.
Special thanks to the industry interviewees Doug Needham, Edwin van Vliet, Fahim Kundi,
Justin Hay, Mattias Lindahl, Mikael Herrmann, Raheel Javed, Ronald Telson, Wayne
Yaddow and Willie Hamann, for providing us with their invaluable knowledge.
Surely, we owe our deepest gratitude to our families for their continuous and unconditional
support.
We would also like to thank Huan Pang, Edyta Tomalik, Miao Fang and Ajmal Iqbal for
helping us by suggesting improvements for the report and providing ideas for the analysis.
ii
Table of Contents
ABSTRACT ........................................................................................................................................... I
1
INTRODUCTION ........................................................................................................................ 1
1.1
SOFTWARE TESTING ............................................................................................................... 1
1.2
CHALLENGES .......................................................................................................................... 2
1.3
CHALLENGES IN DW TESTING ................................................................................................ 2
1.4
PROBLEM DOMAIN AND PURPOSE OF THE STUDY ................................................................... 2
1.4.1
Problem domain ................................................................................................................ 2
1.4.2
Purpose of the study .......................................................................................................... 3
1.5
AIMS AND OBJECTIVES ........................................................................................................... 3
1.6
RESEARCH QUESTIONS ........................................................................................................... 3
3
BACKGROUND ......................................................................................................................... 10
3.1
STRUCTURE OF A DW ........................................................................................................... 10
3.2
DW DEVELOPMENT LIFECYCLE ........................................................................................... 10
iii
5.1
RECOMMENDATIONS............................................................................................................. 52
5.2
MAPPING OF CHALLENGES CLASSES WITH RECOMMENDATIONS ........................................... 57
6
CONCLUSION ........................................................................................................................... 58
6.1
CONTRIBUTION OF THE STUDY .............................................................................................. 58
REFERENCES .................................................................................................................................... 60
APPENDIX........................................................................................................................................... 64
APPENDIX A: RESEARCH DATABASE SPECIFIC QUERIES ..................................................................... 64
APPENDIX B: SLR DATA EXTRACTION FORM ................................................................................... 69
APPENDIX C: INTERVIEW QUESTIONS FOR CASE STUDY ..................................................................... 70
APPENDIX D: INTERVIEW QUESTIONS FOR INDUSTRIAL INTERVIEWS ................................................. 72
APPENDIX E: SLR TEMPLATE ............................................................................................................ 78
iv
Table of Tables
Table 1: Interviewees details from case study .......................................................................... 6
Table 2: Interviewees details from industrial interviews ......................................................... 6
Table 3: Words for search query creation .............................................................................. 15
Table 4: Studies inclusion and exclusion criteria ................................................................... 18
Table 5: Result count at each level ......................................................................................... 20
Table 6: Selected studies after performing SLR (without SST) ............................................. 21
Table 7: Studies selected after applying SST ......................................................................... 22
Table 8: Studies manually selected and the reasons for their selection ................................. 23
Table 9: Testing methods for DW testing process ................................................................. 26
Table 10: Testing tools for DW .............................................................................................. 28
Table 11: Challenges in DW testing from literature .............................................................. 28
Table 12: Summary of testing techniques and tools used by Utvärdering ............................. 35
Table 13: Challenges faced at FK IT ...................................................................................... 35
Table 14: Challenges in DW testing as reported by interviewees .......................................... 40
Table 15: Testing techniques and strategies as suggested by interviewees............................ 42
Table 16: Testing tools or supporting tools as suggested by interviewees............................. 42
Table 17: Controlled databases for various types of testing................................................... 56
Table 18: Mapping of challenges and recommendations ....................................................... 57
v
Table of Figures
Figure 1: Research methodology .............................................................................................. 5
Figure 2: Qualitative Data Analysis stages .............................................................................. 7
Figure 3: DW development methodology .............................................................................. 11
Figure 4: Kitchenham's SLR approach ................................................................................... 13
Figure 5: Biolchini's template................................................................................................. 13
Figure 6: Selected SLR Process ............................................................................................. 19
Figure 7: Number of studies found per year ........................................................................... 30
Figure 8: Number of studies per category distribution ........................................................... 30
Figure 9: Year wise distribution of studies in each category ................................................. 31
Figure 10: DW development and testing process at Utvärdering........................................... 33
Figure 11: Identified challenges with respect to research methods ........................................ 44
Figure 12: Classification of DW testing challenges ............................................................... 48
Appendix Figure 13: DW development lifecycle ................................................................... 72
vi
1 INTRODUCTION
B eing a market leader today requires competitive advantage over rival organizations.
Organizations are expanding fast by indulging into more market domains and sec-
tors. They are trying to digitally handle all core business processes, relationships
with customers, suppliers and employees. A great need for systems exist that could help
such organizations with their strategic decision-making process in this competitive and criti-
cal scenario. By investing in data warehouses, organizations can better predict the trends in
market and offer services best suited to the needs of their customers [46,57].
Over the years a number of definitions of Data Warehouse (DW) have emerged. Inmon [36]
defines a DW as a subject-oriented and non-volatile database having records over years that
support the management’s strategic decisions. Kimbal and Caserta [43] define a DW as a
system that cleans, conforms and delivers the data in a dimensional data store. This data can
be accessed via queries and analyzed to support the management’s decision making process
[57].
However, these days, a DW is being used for purposes other than decision making process,
as well. The uses of DW are commonly found in customer relationship management sys-
tems, reporting purposes, operational purposes, etc. [57]. Thus, recently the DW has been
defined as “a system that retrieves and consolidates data periodically from the source sys-
tems into a dimensional or normalized data store. It usually keeps years of history and is
queried for business intelligence or other analytical activities. It is typically updated in
batches, not every time a transaction happens in the source system.” [57].
One of the main goals of a DW is to fulfill the users’ requirements to support strategic deci-
sion-making process and provide meaningful information [12,28]. However, developing a
DW that achieves this goal and various other goals is not an easy task. DW defects costs
approximately USD $600 billion every year in the United States [67]. It has been stated that
the failure rate for DW projects is around 50% [50]. It is, therefore, evident that like all other
software projects [58], quality assurance activities are a must for DW projects. Unfortunate-
ly, unlike other software projects and applications, DW projects are quite different and
difficult to test. For example, DW testing requires huge amount of testing data in compari-
son with testing of non-DW systems or generic software. DW systems are aimed at
supporting virtually unlimited views of data. This leads to unlimited testing scenarios and
increases the difficulty of providing DW with low number of underlying defects [28]. The
differences between DW testing and non-DW systems testing make the testing of DW sys-
tems a challenging task.
Before proceeding to describe the challenges of DW testing, we first state the meanings of
the terms “software testing” and “challenge” as applied in this study.
1
deliverables can include requirements specifications documents, design documents, the soft-
ware under test, etc.
1.2 Challenges
The definition of Challenge and Problem at Cambridge Dictionaries Online is:
Problem: “A situation, person or thing that needs attention and needs to be dealt
with or solved” [86]
Challenge: “(The situation of being faced with) something needing great mental or
physical effort in order to be done successfully and which therefore tests a person's
ability” [87]
Derived from these two close definitions and from our understanding of the domain, we state
the definition of the Challenges in testing as follow:
“Challenges in testing are the difficulties and obstacles that testers may face dur-
ing their work, which lead to the need for more effort and attention to be dealt
with, and which therefore test an individual’s or group’s skills and abilities.”
These difficulties can come in any form, for example, the need for more effort than expected
to perform a certain task, performance decrease or any other unwelcomed environment that
requires higher skills and extra effort to overcome the difficulties and to complete the task
successfully.
2
process will contribute in designing a comprehensive and practical testing approach for fu-
ture research. Secondly, the challenges should be categorized. Finally, testing techniques or
improvement suggestions that address the challenges should be proposed.
However, to the best of our knowledge, no study has been conducted in the past that aims to
systematically consolidate the research in DW testing and the experienced challenges in DW
testing. Therefore, there was a need to conduct such study.
3
2 RESEARCH METHODOLOGY
4
this phase. Two interviews were based on email questionnaires, while the third one was
Skype® based.
The questions were similar to the first phase, but in second phase, during the discussion we
first stated our findings and then proceeded with the questions. This was done, to let the in-
terviewees confirm or disconfirm the findings as well as to allow them to explain if they
encounter something similar or some other type of testing technique, tool, challenge etc. At
the end of confirmatory phase, we created a classification of challenges.
In order to answer RQ2, the classes of challenges were addressed by the suggestions as
found in the literature or suggested by the interviewees. This classification is discussed in
Section 4.4.2.
5
Table 1: Interviewees details from case study
Interviewee Name Designation
Doug Needham DW consultant for data management at Sunrise Senior Living, USA
Edwin A. van Vliet Manager test data team at ABN Amro, Netherlands.
Justin Hay Principal DW consultant (IBM) and Owner of ZAMA Enterprise Solutions, Canada
Kalpesh Shah
Independent DW consultant
Ronald Telson Responsible for methods in disciplines of Test and Quality Assurance at Bolagsverket, Sweden
Wayne Yaddow Senior Data Warehouse, ETL tester at AIG Advisor Group, USA
Group chairman & founder at Data Base International Super Internet & World Wide cloud
Willie Hamann
computing venture, Australia
6
grounded theory, the context of the information can be easily captured by using QDA meth-
od.
7
3. Application of noticing in interviews with other organizations
During the interviews notes were taken and audio recordings were made, where au-
thorized. The interviews were transcribed and were reviewed to highlight important
information or identify codes.
8
3. Resources: Entities that are required by the process activity excluding any products
or artifacts that are produced during the lifecycle of the project.
Finally, the suggestions were provided to address the classes of challenges in order to lower
their impact and improve the DW testing process.
9
3 BACKGROUND
3.1 Structure of a DW
It is beyond the scope of this document to describe the structure of a DW in detail. However,
to understand how testing can be performed in DW projects and the challenges faced during
testing, we briefly describe the DW structure.
DW systems consist of different components; however, some core components are shared
by most DW systems. The first component is the data sources. DW receives input from
different data sources, for instance, from Point-Of-Sales (POS) systems, Automated Teller
Machines (ATM) in banks, checkout terminals etc. The second component is the data stag-
ing area [44,45,46]. The data is extracted from data sources and it is placed in the staging
area. Here the data is treated with different transformations and cleansed off any anomalies
[45]. After this transformation, the data is placed in the third component that is known as
storage area, which is usually a Relational Database Management System (RDBMS) [57].
The data in the storage area can be in normalized form, dimensional form or in both forms
[57].The dimensional schema can be represented in different ways, e.g. Star schema. A di-
mensional schema can contain a number of tables that quantify certain business operation,
e.g. sales, income, profits etc. Such table is referred as a Fact. Each Fact has adjoining ta-
bles, called dimensions that categorize the business operations associated with the Fact [57].
The process of data extraction from data sources, transformation and finally loading in stor-
age area is regarded as Extract, Transform and Load (ETL). The saved data from the storage
can be viewed by reporting units, which make up the forth component of a DW. Different
On-line Analytical Processing (OLAP) tools assist in generating reports based on the data
saved in the storage area [8,26,28,44,45,46,88]. Due to this sequential design, one compo-
nent delivering data to the next, each of the components must perform according to the
requirements. Otherwise, problem in one component can swiftly flow in the subsequent
components and can finally lead to the display of wrongly analyzed data in the reporting
unit [57].
10
Figure 3: DW development methodology
11
4 CURRENT STATE OF DW TESTING
In this chapter we discuss the currently practiced DW testing techniques and the challenges
encountered during testing, on the basis of Systematic Literature Review (SLR), Snowball
Sampling Technique (SST), industry case study and interviews conducted in various organi-
zations. At the end of this chapter, the classification of challenges is presented and the
various validity threats related to this study are discussed.
12
Figure 4: Kitchenham's SLR approach [47]
1. Planning: In this phase, the research objectives are defined, research questions are
presented, selection criteria is decided, research databases for data retrieval are iden-
tified and review execution procedure is designed [47].
2. Conducting the review: The protocol and methods designed in the planning stage
are executed in this phase. The selection criteria are applied during this stage. This
stage also includes data synthesis [47].
3. Reporting the review: In this phase, the results of the review are presented to the
interested parties such as academic journals and conferences [47].
13
4. Biolchini et al.'s approach is iterative. Kitchenham’s approach is sequential.
In [10], Biolchini et al. provides a template for conducting SLR. This template is based on
different methods:
• Systematic review protocols developed in medical area.
• The guidelines proposed by Kitchenham [48].
• The protocol example provided by Mendes and Kitchenham [52].
This template can be found in [10].
14
• Question
The SLR was conducted on the basis of RQ1. Note that the italic font words are the
base keywords which were used in the search string construction in the conducting
phase.
RQ1. What is the current state of the art in DW testing?
RQ1.1. Which techniques and tools are used for DW testing?
RQ1.2. What are the challenges of DW testing faced in industry and the
challenges that are identified by the literature?
• Keywords and synonyms:
Search keywords extracted from the research questions above (the italic words) are
listed in the Table 3.
Table 3: Words for search query creation
• Intervention
The information to be observed and retrieved is stated here.
The methods for DW testing, the difficulties faced in practicing them and the tools
which can be used for DW testing were observed.
• Population
Peer-reviewed studies, doctoral thesis works, books were used for extracting data.
Grey-literature [89] were used that was found on the basis of backward SST. More
information can be obtained from the inclusion and exclusion criteria, defined later
in this section.
15
4.1.2.1.2.3 Sources identification
The selected sources were exposed to initial review execution.
• Sources list
Following e-databases, search engines and conferences as information sources were
selected. These databases were selected based on the guidelines suggested by Kitch-
enham [47] and Brereton et al. [13]
o IEEE Explore
o ACM Digital Library
o Citeseer Library
o Engineering Village (Inspec – Compendex)
o SpringerLink
o ScienceDirect
o Scopus
o CECIIS1
We included one conference, CECIIS, as we found relevant articles which were not
covered by any of the electronic databases stated above.
Apart from the stated databases, we used the following databases only for forward
SST. These databases have the ability to find studies that cite a certain study.
o ACM Digital Library
o Citeseer Library
o SpringerLink
o ScienceDirect
o ISI Web of knowledge
o Google Scholar
• Sources search methods
By using the identified keywords and boolean operators (AND / OR), a search string
was created. This search string was run on Metadata search. Where Metadata search
was not available, Abstract, Title and Keywords were searched in the electronic re-
search databases.
• Search string
The following general search string was used for execution in all selected electronic
databases.
(
"data warehouse" OR "data warehouses" OR "data mart" OR "data marts" OR "Business Intelligence" OR
"ETL" OR
(
("extract" OR "extraction") AND
("transform" OR "transformation" OR "transforming") AND
("loading" OR "load")
)
OR "large database" OR "large databases" OR "OLAP" OR
(
1
CECIIS (Central European Conference on Information and Intelligent Systems,
www.ceciis.foi.hr/cms/ )
16
"online" AND "analytical" AND "processing"
)
)
AND
(
"testing" OR "test" OR "quality assurance" OR "quality control" OR "validation" OR "verification"
)
AND
(
(
"challenge" OR "challenges" OR "problem" OR "problems" OR "difficulties" OR "difficulty" OR "issues"
OR "hardships" OR "hardship"
) OR (
"tool" OR "tools" OR "approach" OR "approaches" OR "technique" OR "techniques" OR "strategy" OR
"strategies" OR "process" OR "processes" OR "framework" OR "frameworks" OR "automatic" OR "auto-
mation" OR "automate" OR "automated" OR "automating" OR method*
)
)
17
Table 4: Studies inclusion and exclusion criteria
Studies Inclusion Criteria
Filtering Level Inclusion criteria
Abstract / Title filtering • The language of the articles should be English.
• The articles should be available in full text.
• The abstract or title should match the study domain.
Introduction / Conclu- • The introduction or conclusion should match the study domain.
sion filtering
Full-text filtering • The complete article should match the study domain.
Study quality criteria to • The articles should be peer reviewed published studies or doctoral thesis.
be applied at all levels
• The peer-reviewed studies should have literature review, systematic review, case study, exper-
iment or experience report, survey or comparative study.
• Grey literature and books should be selected only after applying backward SST, that is, as a
result of finding relevant references from the selected articles. This was done to reduce the
impact of publication bias. We assume that researchers, who have used grey literature, or
books, have done so without showing any biasness towards any DW vendor or organization.
• Books and grey literature suggested by interview practitioners, industry experts etc. should be
selected as well
• The studies should be related to large databases, data warehouses, data marts, business intelli-
gence, information systems, knowledge systems and decision support systems.
o The studies should be related to factors affecting the implementation
and success factors of the stated systems
o The studies should be related to process improvement for development of stated systems
o The studies should be related to issues related to data quality and information retrieved
from the stated systems
o The studies should be related to quality attributes of the stated systems
o The studies should be related to measuring the quality attributes of the stated systems
Studies Exclusion Criteria
All studies that do not match the inclusion criteria will be excluded.
The criteria were applied in three different levels, abstract / title, introduction / conclusion
and full-text review. The study quality criteria were applied at all levels. Figure 6 describes
the SLR execution process.
18
Figure 6: Selected SLR Process
19
4.1.2.2 Conducting phase
Scopus 1847 43 16 13
Citeseerx 290 24 11 4
CECIIS 2 2 2 2
2
We executed the same query on title, abstract and keywords, separately in the database. See Appen-
dix A
20
Table 6: Selected studies after performing SLR (without SST)
Study name Authors
A comprehensive approach to data warehouse testing [28] M. Golfarelli and S. Rizzi
A family of experiments to validate measures for UML activity diagrams of ETL L. Muñoz, J.-N. Mazón, and
processes in data warehouses [54]
J. Trujillo
A set of Quality Indicators and their corresponding Metrics for conceptual models G. Berenguer, R. Romero,
of data warehouses [7]
J. Trujillo, M. Serrano, M. Piattini
Application of clustering and association methods in data cleaning [19] L. Ciszak
Architecture and quality in data warehouses: An extended repository approach [38] M. Jarke
BEDAWA-A Tool for Generating Sample Data for Data Warehouses[35] T. Huynh, B. Nguyen, J. Schiefer, and
A. Tjoa
ETLDiff: a semi-automatic framework for regression test of ETL software [73] C. Thomsen and T.B. Pedersen
How good is that data in the warehouse? [2] J.M. Artz
Logic programming for data warehouse conceptual schema validation [22] C. Dell’Aquila, F. Di Tria,
E. Lefons, and F. Tangorra
Measures for ETL processes models in data warehouses [55] L. Muñoz, J.N. Mazón, and
J. Trujillo,
Metrics for data warehouse conceptual models understandability [63] M. Serrano, J. Trujillo, C. Calero, and
M. Piattini
Predicting multiple metrics for queries: Better decisions enabled by machine A. Ganapathi, H. Kuno, U. Dayal,
learning [27]
J.L. Wiener, A. Fox, M. Jordan, and D.
Patterson
Simple and realistic data generation [34] K. Houkjær, K. Torp, and
R. Wind,
Synthetic data generation capabilities for testing data mining tools [39] D. Jeske, P. Lin, C. Rendon,
R. Xiao, and B. Samadi
Testing a Data warehouse - An Industrial Challenge [68] H.M. Sneed
Testing the Performance of an SSAS Cube Using VSTS [4] X. Bai
The Proposal of Data Warehouse Testing Activities [71] P. Tanuška, O. Moravcík,
P. Važan, and F. Miksa
The Proposal of the Essential Strategies of DataWarehouse Testing [70] P. Tanugka, O. Moravcik,
P. Vazan, and F. Miksa
Towards data warehouse quality metrics [15] C. Calero, M. Piattini, C. Pascual, and
A.S. Manuel
After performing the SST we found the studies shown in Table 7. These studies included
grey literature and peer-reviewed studies. It is important to note that the reason for high
number of SST findings is due to the inclusion of non-peer reviewed studies. SST was used,
firstly, to reduce the chances of missing relevant studies using Biolchini’s template. Second-
ly, SST was used to gather non-peer reviewed studies as well. And thirdly, we found many
articles, which, though, did not come under the topic of DW testing; however, researchers
have used them in their studies to explain different concepts of DW testing. We have includ-
ed the articles, which other authors found important for DW testing.
21
Table 7: Studies selected after applying SST
Study name Authors
3LSPG: Forensic Tool Evaluation by Three Layer Stochastic Process-Based Y. Yannikos, F. Franke, C. Winter, and
Generation of Data [77] M. Schneider
A framework for testing database applications [16] D. Chays, S. Dan, P.G. Frankl,
F.I. Vokolos, and E.J. Weber
A parallel general-purpose synthetic data generator [33] J.E. Hoag and C.W. Thompson
A safe regression test selection technique for database-driven applications [75] D. Willmor and S.M. Embury
Data Warehouse / ETL Testing: Best Practices [53] A. Mookerjea and P. Malisetty
Data Warehouse Testing [12] K. Brahmkshatriya
Designing a Customized Test Data Generator for Effective Testing of a Large J. Singh and K. Singh
Database [66]
Detecting Redundancy in Biological Databases? An Efficient Approach [17] S. Chellamuthu and
D.M. Punithavalli
Empirical Validation of Metrics for Conceptual Models of Data Warehouses [62] M. Serrano, C. Calero, J. Trujillo, S.
Luján-Mora, and M. Piattini
Generating Synthetic Data to Match Data Mining Patterns [23] J. Eno and C.W. Thompson
Generation of synthetic data sets for evaluating the accuracy of knowledge discov- D.R. Jeske, B. Samadi, P.J. Lin,
ery systems [40]
L. Ye, S. Cox, R. Xiao,
T. Younglove, M. Ly, D. Holt, and R.
Rich
How to Thoroughly Test a Data Warehouse [20] R. Cooper and S. Arbuckle,
Regression testing of database applications [30] R.A. Haraty, N. Mansour, and
B. Daou
Statistically Analyzing the Impact of Automated ETL Testing on the Data Quality K. Singh, J. and Singh,
of a Data Warehouse [67]
Strategies for Testing Data Warehouse Applications [72] J. Theobald
SVTe: a tool to validate database schemas giving explanations [24] C. Farré, G. Rull, E. Teniente, and T.
Urpi,
Testing the data warehouse [29] D. Haertzen
Testing the implementation of business rules using intensional database tests [76] D. Willmor and S.M. Embury
Building the data warehouse [37] W.H. Inmon
The data warehouse ETL toolkit [44] R.Kimball, J. Caserta
The data warehouse lifecycle toolkit [46] R. Kimball, M. Ross,
W. Thorthwaite, B. Becker,
J. Mundy
The Kimball Group Reader: Relentlessly Practical Tools for Data Warehousing R. Kimball, M. Ross
and Business Intelligence [45]
Apart from performing SST, we manually selected few books and grey literature. We present
the reasons for selecting these studies in Table 8.
22
Table 8: Studies manually selected and the reasons for their selection
Study name Authors or source Type of study Reason for selection
Blog by Justin Hay [31] J. Hay Grey literature This link was suggested by industry expert
Data Warehouse Testing [51] M.P. Mathen Grey literature This link was suggested by industry expert
Building a data warehouse: V. Rainardi Book We have extensive experience of using
with examples in SQL Server SQL server. Therefore, in order to remove
[57] any misunderstandings related to DW we
selected this book.
23
• Software metrics or measures
This category holds the studies that focus on using or proposing software measures
and metrics that can help the reviewer or tester.
• Others
This category holds blogs, books, or studies which focus on more than one of the
above categories.
We will now present the extracted information.
24
from requirements gathering phase [28]. Few metrics for measuring the quality of
ETL processes on the basis of UML activity diagrams have been suggested by [54].
The authors of [24,76] propose automated schema testing tools for verifying the
business rules implementation and the database schema correctness and complete-
ness.
3. Implementation phase
In the implementation phase, the ETL procedures are implemented in the system and
the front-end reports are developed [44,57]. OLAP cubes are also created in this
stage. The authors of [4] provided a method for testing the OLAP cube. The perfor-
mance of the queries written for fetching the records from the database can be
predicted and tested by using the metrics provided by [27]. [55] and [27] have pre-
sented software metrics for measuring ETL operation performance and predicting
performance of report queries respectively.
Unit testing of ETL scripts and reports are performed at this stage [28,44,57]. The
impact of automated testing of ETL scripts has been shown highly positive by the
authors of [67].
One study has also stated the test of naming conventions used across the develop-
ment cycle [31]. For example, if an attribute “deliverydate” is used to refer to the
date of delivery of a product to a customer, in one table, then it should be used
across the whole DW the same way, to refer to the phenomenon of date of delivery
of a product to a customer.
4. Testing and deployment phase
In the testing phase, dynamic testing is performed. The complete DW system under-
goes several types of tests and in different kinds of usage environments.
The testing procedure consists of a series of different testing methods targeting spe-
cific aspects of a DW.
Table 9 describes different testing methods as suggested by various researchers.
25
Table 9: Testing methods for DW testing process
Testing method Purpose and description
Data completeness Data completeness ensures that all expected data is loaded in the relevant tables from
[12,19,20,28,29,31,51,53,66,67,72] source tables. Some of the strategies to ensure such goal can be
• Comparing the record counts between source tables and target tables, as well as
rejected records tables or error tables.
• Comparing unique values of key fields between source data and loaded data in DW.
• Checking the range and values distributes in data sets of source and target data sets
using data profiling tools.
• Performing field to field matching of source tables and target tables
• Validating that no value is truncated. This is done by populating the contents of each
field to maximum length/value in source and verifying it in target.
• Performing boundary value analysis to verify each field. Boundary value analysis
deals with using maximum most and minimum most values, as well as not allowed
values.
• Performing all of the above, in a repeated way so that data leaks can be identified.
Data leaks mean there should be no missing rows or values over the weeks of data.
Test for relevancy [17] The gathered data should be relevant to the object and operation that is expected to be
performed on it. For example, if report requires aggregation of sales for a certain store
covering a period of certain number of weeks, then gathered data should include all
entities required by this requirement.
Data transformation It ensures that all data from source tables is transformed correctly according to business
[17,20,45,70,71,72] rules and specifications.
Data quality Data quality validation ensures that the ETL process reports, correct and ignores invalid
data, substitutes default data and rejects data that is not required. Data quality includes
[20,29,43,45,53,66,67,70,71,72] the testing for data transformation rules, data completeness and tests for data relevancy,
as described above.
Ability to perform in high vol- All data warehouses are designed to deliver high performance based upon the previous-
ume data[28,57,66] ly stated goals in a high volume data environment. That is, the DW should behave
normally in presence of terabytes of data.
Recoverability[28,57] The ETL system should be able to recover itself in break down conditions. Secondly no
data losses should take place.
Testing the business require- All business requirements should be met by the system. For example, if the aggregation
ments [76] of sales for a particular product over three months of time is required, the system should
be able to handle this requirement.
Verify Slowly Changing Dimen- Different dimensions undergo changes when their values change over time. For exam-
sions (SCD)[44,46,57] ple a person’s address can change over the period of years. The system should handle
SCD of type 1, 2, 3. For more details refer to [43,44,57]
Verification of performance of The ETL batch loading system should be able to extract and load the records in timely
ETL batch loading[57,70,71] manner, or as required by the non functional system requirements.
Test for daily, weekly or monthly The ETL batch scripts designed for daily run, weekly run or monthly run should execute
batch runs[57,70,71] at the expected dates and periods.
Security testing Users should be able to access the data, reports, functionality, for which they are author-
ized to. If role-based security is used, then users belonging to certain roles should be
[28,57,66] able to view only the data that is available for their role.
OLAP Cube testing [4] OLAP cubes are data structures that support fast data analysis. The cube is a three di-
mensional query structure. It is necessary to verify that the cube query works as
required.
The testing methods presented in Table 9 are conducted in each of the following dif-
ferent types of testing [72]. It should be noted here that [11] defines these types of
testing as the stages of testing.
• Unit testing
Unit testing is normally performed by a developer. The purpose of unit testing
is to test the smallest piece of testable code to verify if its behavior is accord-
ing to the design specifications. The code is isolated from rest of the modules
and units [11]. In DW scenario, unit testing is performed to check the ETL
code. This method also aims to verify data completeness, data transformation
and data quality [72].
26
• Integration testing
Integration testing is an end-to-end test that validates if all components and
modules, in combined form, behave in an expected manner. The purpose of in-
tegration testing is to verify if the components behave according to design,
with each other [11]. In a DW, integration testing deals with the verification of
ETL processes with upstream application, (for example, OLAP applications)
and downstream applications [72].
• System testing
System testing, is similar to integration testing, however, the only difference is
that in system testing, testing is performed in order to verify the non-functional
requirements, such as performance in production-like environment [57].
In the case of DW, all processes, including the background processes, report
processes, data population processes etc. are made to run, and then end-to-end
testing is performed. The purpose of system testing is to validate if the system
will run according to requirements in live environment [57].
• Acceptance testing
Acceptance testing is performed to verify if the business requirements, as stat-
ed by the users or clients, are properly implemented. Non-functional
requirements, especially user-friendliness of the system or the subject under
test, are also verified in this testing [11].
Acceptance testing is performed in production-like environment to depict the
live scenario, that is, while all processes are running. The user friendliness of
the developed applications is also assessed at this level. With respect to DW
testing, implementation of business requirements, user-friendliness of the re-
ports etc. are verified [57,72].
• Regression testing
Regression testing revalidates if the system works according to user require-
ments after code changes in different underlying modules. These changes
could be due to change requests or due to fixing of certain code defects [11].
With respect to the DW testing, if a code is modified, such as ETL script code,
report code etc. regression testing is performed [57].
5. Support and maintenance
In the support and maintenance phase, the whole system, specially the ETL opera-
tion is monitored [57]. If the users or the monitoring team report some defects, the
defects are addressed, fixed and patches are applied on the system [28]. This leads to
regression testing, which revalidates the system if it works according to the user re-
quirements after changes. The studies [30,75] have given approaches that enable the
tester to select only the most relevant test cases for regression testing. Automated re-
gression testing tool, ETLDiff has been proposed by the authors of [73].
27
Table 10: Testing tools for DW
Tool Name Tool purpose Tool class
(Commercial, Opensource or proposed
by researchers)
DSG Prototype [66] Test data generation Proposed by researchers
3LSPG [77] Test data generation Proposed by researchers
Inability to use real data for test- Due to privacy laws in certain countries, it is not possible to use real data for
ing purpose [23,39] testing purpose.
Lack of information what to look In security purpose data mining applications (e.g. Terrorist threat detection), there
for 3 [39] is normally a lack of sufficient specific knowledge on what to look for.
Stakeholders and designers gap The requirements documentation poses interpretation problems by the testers.
3
According to one team member, the challenge is specific to security data mining applications. According to the other team
member, the challenge is related to the inability of testers to visualize what kind of information the DW users would try to look
for, thus creating a challenge in testing the unknown.
28
[68] Testers can perform testing based on in-correct requirements understanding
Lack of steps for normalization [2] There are no strict rules or steps which can help the designer in verifying if the
degree of normalization is good for the design or not.
Implicit data relationships [2] The testers should be able to understand the implicit data relationships between
different entities of the schema. These relationships might be hidden and can only
be understood if the testers finds out about them from the respective developers,
designers or architects etc.
Lack of understanding of the The testers should posses the knowledge of understanding the schema, the normal
schema [57] forms and the dimensional forms. This will help them in writing good queries that
are fast, accurate and easy to understand.
Lack of understanding the data The normalization design depends on how good the designer understands the data
relationship [2] relationships, thus, the tester should have the same understanding in order to test
the normalization design
Unavailable testing data [39,77] In many cases, R&D labs developing testing tools are not able to easily obtain
sensitive information due to security, privacy or cost issues.
Duration prediction for queries It is difficult to predict how the queries of the report will perform in the produc-
[27] tion environment
Huge effort for regression testing Due to high number of test cases, regression testing for DW is very difficult to
[73] perform in limited time. High effort is required to perform and understand the
impact analysis for regression testing.
Lack of customizability of test The tools that are able to create the test data, are very specialized, and cannot be
data generation tools [34] reused in different environment.
Limited schema support from test The specialized data generation tools have limited support for complex database
data generation tools [34] schemas, e.g., composite keys and cycles in foreign-key relationships
Lack of huge test data amount DW testing requires a huge volume of sample data to test.
[28,35,51]
Unlimited combination of test The user has the ability to check any kind of views of the stored data. This leads
cases [28,51,68] to the creation of unlimited combination of test cases.
Lack of richness in the real data Real data could be insufficient to cover all the test scenarios
[77]
Test case dependency [16] In order to execute a certain test case, it is possible that it will call for execution
of some other test cases. Thus, even a single test case can lead to the execution of
high number of test cases which requires time and effort.
Lack of meaningful synthetic data Data can be generated but it might miss some database constraints, integrity
creation [16] constraints, and business constraints.
Difficult selection of test cases for From the unlimited testing scenarios, it is difficult to select the most important
regression testing [75] and limited test cases for regression testing.
Lack of formal and quantitative There is a lack of formal software metrics for DW design and testing.
metrics [7,62]
Performance issues with queries Poorly designed queries for reports or OLAP may perform very bad in high
design [66] volume database environment
Identification of issues with que- It is difficult to identify the cause of low performing query. Is it because of the
ries [57,66] low quality report query or the DW schema design?
Difficult in assessing data source While strict quality assurance activities can be enforced within the DW develop-
quality [51,57] ing organization, however, it is not possible to ensure quality assurance activities
in data source organizations. This may lead to poor quality of data coming from
data sources. This makes data cleaning and transformation efforts very difficult.
Lack of business knowledge [51] Testers may not be aware of the organization-wide business knowledge.
Lack of traceability [51] Transaction-level traceability will be difficult to attain in a Data warehouse. This
makes testing tasks difficult as only source to target matching can be performed
and not backward.
Lack of time for designing, plan- Due to the high number of test cases, unlimited testing scenarios, time is never
ning and creating test cases [76] enough to properly design, plan and execute the test cases.
Lack of fact to dimension ratio in The test data created by the data generation tools might not populate the fact
created test data [66] tables and dimension tables with realistic volume of data.
Lack of maturity of data genera- The test data generation tools are not mature right now. That is, either they don’t
tion tools [33,66] provide the requested functionality or they have high number of defects.
Lack of schema merging ability in In order to simulate the data sources, it is possible that the testers may require
29
data generation tools [66] different kinds of data sources, for example document files, SQL Server data-
bases, Oracle databases etc. The tools lack the ability to merge different kinds of
schemas and databases.
Inability of ETL tools to handle The ETL tools may stop the process if a column type mismatch error takes places.
type mismatches [66] This calls for manual intervention of testers. This can lead to high effort and
requirement for time. There is no automatic way for handling such issues.
10
3
8
1 1 1
6
No.
of
studies
2
4
2 6 6
5
2 4
3
2 2 2
1 1 1 1
0
1997
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
Yea r Peer
Reviewed Grey
L iterature Books
20
15
10
16
11
5 10 7
0
Testing Tools Measures Others
techniques and
metrics
30
Figure 9 shows the distribution of studies for each category over the years.
12
10
8 4
6 2 2
No.
of
studies
2
1 1
4
1 3 1
1 1 3
2 4
1 3
2 2 2 2
1 1 1 1 1 1
0
1997
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
Yea r
Testing
Techniques Tools Measures
a nd
metrics Others
31
4.2 Case Study
A case study is a strategy of inquiry that enables researchers to study a phenomenon in detail
[21]. For this study we followed guidelines for conducting a case study as provided by Rune-
son et al. [60]. Runeson et al. [60] gives five steps for conducting a case study. These steps
are as under:
1. Case study design
2. Preparation for data collection
3. Collecting evidence
4. Analysis of collected data
5. Reporting
32
4.2.2 Data analysis
The data analysis was performed using the QDA method as suggested by [61]. The details of
the process are described in Section 2.2.
4.2.3.1.1 Specification
This phase has one sub-phase called Order. In this phase, the requirements from the stake-
holders (e.g. clients, customers, users etc.) are gathered. The requirements are tested based
upon subjective evaluation, that is, based upon requirement analyst’s experience. Reviews
are held for verifying the requirements. No metrics or measures are used to verify the re-
quirements. UML use cases along with descriptive requirements are used for requirements
33
documentation. Meetings are held to discuss the requirements with the stakeholders. The
meeting includes participants from Utvärdering and from the DW development unit. DW
development process is outsourced to a third party organization, referred as deliverers.
The deliverers confirm the requirements by presenting detailed requirements specifications
according to their understanding, suggested solutions as well as expected cost quotes. Ut-
värdering performs the evaluation of these deliverables. The evaluation sessions are
conducted in joint application development sessions. If the deliverables are accepted, process
is moved to implementation phase; otherwise the deliverers present alternate solutions.
4.2.3.1.2 Implementation
The implementation phase comprises of two sub phases: Actualize and Accept.
4.2.3.1.2.1 Actualize
The deliverables of this phase are the implemented requirements that are set in specifications
phase. Most of the time the deliverables are ETL batch scripts. Sometimes, the deliverables
include schema designs as well.
Unit testing is performed to check if the scripts run without errors. Functional testing is per-
formed to check if the deliverables satisfy the business requirements. Both of these types of
testing are conducted by the deliverers. The design is tested based upon the evaluator’s expe-
rience. No metrics are used. The deliverers sometimes face the problems related to test data.
The test data is either too small or does not represent the real time data scenarios. Utvärder-
ing cannot provide the data to the deliverers as it is against the data privacy regulations of the
organization. Sometimes, masked data is sent to the deliverers; however, Utvärdering is find-
ing other solutions for creating artificial data.
4.2.3.1.2.2 Accept
The accept phase, or the testing phase, is completely performed by Utvärdering. The test
cases are designed by testers at Utvärdering. Deliverables, deployed by the deliverers, are
tested using module testing, integration testing, system testing, acceptance testing, stress
testing, regression testing, report testing, and statistical testing4.
For different testing methods, different sizes of databases are used. Normally these sizes are
in Gigabytes. For acceptance testing, the production database clone is used. Data masking or
data subsetting techniques are used to generate a clone. But in such case there is a possibility
of missing important test scenarios, as the data transferred to test database could be of any
scenario, for example, it could miss not-null cases. Thus, there is a need for synthetic data
generation tools, for creating controlled testing databases.
Although rigorous testing is performed using detailed test cases, however, there is lack of
using proper tools to help in improving the quality of the system. In case of any defect de-
tection, they depend on email communication to report the defects, rather than mature defect
management tools.
4.2.3.1.3 Production
This phase has only one sub-phase, called Operation. The production phase deals with the
loading of solution and getting it ready for the live environment. Smoke testing5 is performed
4
In order to understand these testing types, we refer the reader to [57]. Information for testing using
statistical function can be found at [90]
5
Information for smoke testing is available at [84]
34
to verify if the release has been deployed properly. The solution, at the end, is finally made
live.
4.2.3.1.4 Evaluation
In the evaluation phase, the clients or customers of Utvärdering use the solution. In case of
any defects or change requests, requirements specification phase of the new release is initiat-
ed.
Table 12 summarizes the testing methods, tools etc. that are used at Utvardering during dif-
ferent phases of DW testing.
Table 12: Summary of testing techniques and tools used by Utvärdering
Phase Test Techniques Tools Used Testing Database Size
35
4.3.1 Purpose of interviews
36
or need to be modified. Five interviewees also stated that data source profiling is a
mandatory step in DW development lifecycle as the design phase depends upon its
results. The same interviewees stated that data source profiling is a tedious process,
which requires time. However, the management is usually unwilling to give more
time.
Once the data sources are profiled, documented requirements are revisited for any
changes or modifications. Joint Application Development (JAD) meetings are held
with all the stakeholders. According to one interviewee, sometimes when data
sources are unable to satisfy the requirements, modifications in the data sources are
made. As an example, consider a report that requires certain attribute which is cur-
rently not available in any of the data sources. Thus, the data sources are made to
implement the attribute.
After the verification of requirements, conceptual schema is designed. The concep-
tual schema is created in the form of Entity-Relationship (ER) diagrams or fact
diagrams etc.
One interviewee has reported that most of the defects in the DW releases arise due to
the incorrect requirements gathering. There interviewees have stated that the users
and the clients of DW project are never literate enough to give complete require-
ments, which causes change of requirements. This can result in huge cost as all the
steps in requirements phase need to be performed again.
2. Design phase
After the requirements phase, the logical schema is designed. The logical schema re-
sults in the creation of relational schema in normalized or dimensional form. Three
interviewees reported that the logical schema is reviewed by the architects and the
designers.
The physical schema involves creation of indexes, integrity constraints etc. In some
cases, different vendors offer their pre-designed schemas as templates. In such cases,
only the mapping of requirements is performed. Sometimes, customization of such
templates is performed to match the requirements.
One interviewee, who has the experience of working in a number of large organiza-
tions, reported that in many organizations database reviews are skipped or are
performed by the same people who designed the databases. In such cases, there is a
possibility that some defects are missed and are detected at the later stages of devel-
opment lifecycle.
One interviewee has reported that in many DW projects, no naming conventions are
used. This creates a problem in the later stages as different names of the same attrib-
ute may create misunderstanding to the developers and testers. This leads to delay
and low productivity of the team.
Once the schema has been designed, source-to-target mapping documents are creat-
ed. All interviewees have reported the need for highly detailed documentation that
states explicitly the source tables, target tables and the enforced transformation rules.
The source-to-target mapping documents go through thorough reviews as stated by 2
interviewees. The review teams include business and system analysts, database de-
signers, DW architects etc. These documents are used by testers and developers in
the later stages of DW development. The documents need to explicitly mention each
and every business transformation and mapping as detailed as possible.
3. Implementation phase
The developers implement the requirements and create the ETL scripts, reports etc.
The developers are responsible for conducting the unit testing. Therefore, the devel-
opers require test databases. In this regards, while an interviewee has complained
37
that organizations do not provide them with large enough test databases, two other
interviewees, have stressed the need for having small sized, carefully created test da-
tabases having rich test scenarios for developers and testers. They stated that large
databases pose disk cost issues and decrease developers and testers productivity.
The ETL scripts are tested for completeness, consistency and accuracy of data. The
reports are tested for correct display of records, summation of records, categorization
of records etc. An interviewee reported that sometimes, testers are invited for re-
views of ETL scripts and other deliverables so that proper test cases can be
generated for the next phase. Doing so lets the testers get accustomed to the usage of
deliverables.
4. Testing and deployment phase
The DW development moves towards the testing phase, once all implementation
phase has ended. An interviewee has stressed that for testing the system from differ-
ent perspectives, the team of testers should be different than the persons who
designed and developed the products.
One interviewee has reported that the process for planning the test cases starts as ear-
ly as DW design phase. Test cases are created on the basis of source-to-target
documents and the requirements documents.
All interviewees have stated that DW testing phase goes through integration testing
and system testing. During these testing phases, the DW is tested for the following:
I. Data Quality
Data quality testing checks if the data has been correctly transferred from the
source database tables or data sources to the target database tables or files.
Various checks are performed on the system in order to ensure data quality.
Examples of such checks include verification of column integrity constraints,
enforcement of business and transformation rules, no loses of data with re-
spect to data length, data type, data value, null values etc.
II. ETL script operations
All ETL scripts must run as expected. The ETL scripts should be executed as
scheduled.
III. Stress testing
The whole DW system should be able to perform according to performance
requirements under the expected workload. For example, the reports should
be able to display the data while ETL operations are being performed.
All interviewees have reported that the above testing phases are followed by ac-
ceptance testing. Acceptance testing is performed in order to verify the requirements,
especially the non-functional requirements. Non-functional requirements may in-
clude performance, usability of the reports and other components.
Two interviewees have reported the use of various logs, created by different compo-
nents in DW system, for testing purpose. The logs are tested for complete
documentation of errors and various other problems.
If any component of system is modified at any stage, regression testing is performed.
Regression testing ensures that old functionality of the system that has not been
modified is intact. Two interviewees have reported that they face difficulty in select-
ing test cases for regression testing. They reported that either the test cases are too
many or the impact of the code modification is not very clear.
One interviewee reported a tool, TOAD, for database tuning and SQL development.
According to the interviewee the tool is used for writing queries and verifying sche-
38
mas, integrity constraints etc. Five interviewees have reported that custom-made
tools are used in their organizations for performing integration testing.
Two interviewees have stated the use of statistical testing, using Chi square statisti-
cal methods and smoke testing just after the deployment of releases. In smoke
testing, only basic review of the system is performed after the deployment of the re-
lease in the production environment. The review checks if all essential components
have been deployed and are in working condition.
The lack of automated testing tools for integration testing of the DW system has
been cited by four interviewees. According to the interviewees, such tools either do
not exist or are not mature enough with respect to stability and functionality.
Three interviewees reported the use of exploratory testing approach after the ac-
ceptance testing. One of the three interviewees reported that in order to get the
testing done from different perspectives, weekend testing or Friday testing was in-
troduced in their organization. Every Friday, the testers were required to test
anything in the system according to their will, but based on special themes of testing.
The themes of testing were based on various features, functionalities of the DW sys-
tem. For example, a theme of testing can be usability of the reports of the DW,
performance testing of the DW system, data cleansing of the DW system etc. Two of
the three interviewees, however, did not state any specific method of exploratory
testing that was used in their organizations.
In response to the question for test data usage, five interviewees have stated the use
of masked data or data subsetting as test data. Two interviewees reported the use of
artificial test data; but according to them there is a need for good test data generation
tools. One such tool, Grid Tools Data Maker, has been suggested by one of the two
interviewees.
One interviewee has stated a different strategy for creating artificial test data. Ac-
cording to the interviewee, the source systems are required to create the test data.
Thus, realistic but artificial source data is created.
Two interviewees reported high disk-size related cost issues arise due to large test
databases. In order to overcome this challenge, one of the two interviewees suggest-
ed the use of carefully controlled test data databases having rich scenarios.
Lack of communication among the developers, customers, testers, architects etc. has
been reported by two interviewees.
5. Support phase
The support phase of the DW development lies during the actual usage of DW sys-
tem and its components by the customer or users. Four interviews have stated the
change of requirements as a challenge. According to the interviewees, when users
come in contact with the DW system, they find various possibilities of viewing the
data. In many cases, the possibilities cannot be fulfilled by the developed DW sys-
tem. Thus, change requests are raised. However, it is a highly expensive process to
deal with the changes in requirements. The interviewees recommend the presence of
a highly skilled data analyst in the requirements phase. One of the four interviewees
recommend that due to the high cost and required effort changes in the system
should be treated as a new release development rather than change requests. Besides,
the customers or clients are usually reluctant to give enough time for these change
requests to be implemented.
39
The challenge of bad data source quality, inability to use real data, lack of maturity of testing
tools, unlimited testing scenarios and the gap between stakeholders and DW team were con-
firmed by the interviewees. According to the interviewees, none of them came across the
other challenges as identified as a result of case study and SLR. The interviewees were of the
opinion that all challenges stated in the literature were practical and can be faced in different
environments. But such challenges were not faced by them. The interviewees were also un-
aware of the various software metrics as proposed by studies in SLR.
Interestingly, all three interviewees provided additional challenges and testing techniques.
One interviewee stated the challenge of skilled employee turnover. It is possible that a tester
leaves the organization. If a new employee is hired, trainings and business knowledge need
to be provided again.
According to another interviewee there are difficulties in performing data conversions or
transformations of historical records or data sources. It is quite common that DW projects
come after all of the data source systems have become stable. It is also highly likely that old
systems are not properly documented, thus, it gets very difficult to come up with proper
transformations and conversations. The interviewee also stated that due to involvement of
different business partners and variety of different users it gets difficult to keep the scope
limited and determining the data ownership of various entities. High number of stakeholders
also leads to delay in agreements for requirements and creates a lagging communication.
According to an interviewee, presence of legacy systems and different data sources also
leads to conflicting standards. As a DW project can involve different sub organizations in an
enterprise or different independent organizations in case of public organizations, there is no
central quality control command. Thus, neither the data source quality is easily improved nor
the delay in communication between different organizations is reduced.
One interviewee also reported that large scale testing infrastructure is required, which leads
to high cost. Organizations are usually reluctant to provide high budget for testing environ-
ments. The testers should be skilled as well. One interviewee reported that testers sometimes
ignore the complexity in the DW infrastructure and benchmark the performance of the sys-
tem with queries that are logically correct, but that do not consider various other factors such
as the test data volume, the server load on testers’ machine, number of concurrent running
processes etc.
Interviewees have also reported that various testing related tools for DW systems are vendor
specific or data type specific. This either increases the cost of testing infrastructure or makes
the management decline the buying of such tools.
One interviewee has reported that many times implementation of error conditions and excep-
tion handling procedures in ETL scripts are ignored by the developers. This makes it difficult
for the tests to identify more defects, as error logs may not reveal defects.
While the lack of software metrics was suggested by the interviewees, one interviewee sug-
gested some metrics that can be used for measuring the quality of the DW system and the
project. The metrics can be found at [31].
Because of all these challenges and difficulties, an interviewee reported that it gets very dif-
ficult to effectively estimate the budget and effort requirements for the specific DW project.
40
Limited time for testing Assigned time for testing is less than required.
Lack of guidelines for conducting There are no strict guidelines regarding how to conduct reviews for logical or
reviews of physical or logical physical schema designs.
schema
Lack of richness in test data The test data is insufficient to cover all the test scenarios
Disk-size cost issue Large databases for testing purpose require large disk size, which can be costly.
Inability to use real data for test- Due to privacy regulations it is not possible to use real data for testing purpose.
ing purpose
Lack of maturity of end-to-end The testing tools are not mature right now. That is, either they don’t provide the
testing tools requested functionality or they have high number of defects.
Unlimited combination of test The user has the ability to check any kind of views of the stored data. This leads
cases to the creation of unlimited combination of test cases.
Lack of testers skills Testers should be skilled in writing good SQL queries, thinking about different
scenarios for testing, understand the designs etc.
Difficult selection of test cases for From the unlimited testing scenarios, it is difficult to select the most important
regression testing and limited test cases for regression testing.
Lack of test data generation tool Good test data generation tools are not available right now.
Gap between stakeholders and Lack of communication between stakeholders and the DW team causes misunder-
DW team standing of requirements.
Gap between different roles of DW Lack of communication between different roles, such as developer, tester, system
team analyst etc. causes misunderstanding of requirements and creates delays.
Knowledge worker turnover issues Knowledge of the tester or other employee should be retained in case the employ-
ee leaves the organization.
High number of stakeholders High number of stakeholders leads to difficulty in creating test cases handling
different perspectives of stakeholders. Different stakeholders may have different
requirements. This leads to unambiguous requirements and lagging communica-
tion.
Responsibility issues Presence of different roles in DW team and different stakeholders including
different organizations make it difficult to find persons responsible for a certain
issue.
Bad data source quality The quality of data from the sources is not assured.
Conflicting standards for data Many defects are found due to conflicting standards between data source organi-
source organizations zations and DW organization. It is difficult to identify if it should be raised as a
change request to data source organization or DW organization, or if it should be
raised as a defect to data source organization or DW organization.
Need for large and costly testing There is a need for a costly and large testing infrastructure which requires tools,
infrastructure trainings etc.
Lack of customizability of test The testing tools are very specialized, and cannot be reused in different environ-
data generation tools ment.
Difficulty in estimating the DW Due to the numerous challenges, as mentioned above, the cost of tools, trainings,
testing budget testing efforts etc., it is difficult to estimate the DW testing budget.
41
Table 15: Testing techniques and strategies as suggested by interviewees
Testing techniques and strategies Description
Reviews Reviewing of documents, software code using certain guidelines
JAD sessions Joint application development sessions are used to clear the requirements and to
verify the requirements
Data quality testing Data quality testing checks if the data has been correctly transferred from the
source database tables or data sources to the target database tables or files. Vari-
ous checks are performed on the system in order to ensure data quality. Examples
of such checks include verification of column integrity constraints, enforcement
of business and transformation rules, no loses of data with respect to data length,
data type, data value, null values etc.
ETL Script operations All ETL scripts must run as expected. The ETL scripts should be executed as
scheduled. If scripts are scheduled to be run after every month, it should be tested
accordingly.
Stress testing The whole DW system should be able to perform according to performance
requirements under the expected workload. For example, the reports should be
able to display the data while ETL operations are being performed.
Smoke testing Basic review of the system is performed after the deployment of the release in the
production environment. The review checks if all essential components have been
deployed and are in working condition.
Exploratory testing A style of software testing that emphasizes the personal freedom and responsibil-
ity of the individual tester, to continually optimize the quality of his/her work by
treating test-related learning, test design, test execution, and test result interpreta-
tion as mutually supportive activities that run in parallel throughout the project.
Weekend testing Testing is performed on the last day of the week, based on certain themes.
Themes of testing can be testing for functionality, usability, or performance etc.
Testers are free to select any theme and perform testing in an exploratory way
without using any test cases.
Statistical testing Sample of data is selected and Chi square method is used to verify if the distribu-
tion of the data is correct.
Acceptance testing Testing is performed to verify that the system works according to the require-
ments. It is usually performed manually and only supporting tools, for example
query browsers, Toad, are used.
Usability testing Testing of reports, scripts and other components that can be used by the users.
The purpose is to check the user friendliness of the objects.
Performance testing All components, reports, ETL operations etc. should complete under the required
time limits.
Logs testing ETL operations logs, system logs or any other logs created during the system are
tested to ensure that proper logging is performed. Logging of failures as well as
successful runs should be tested.
42
data quality testing, ETL operations and scheduling testing, report testing, statistical testing,
system performance testing etc. in different testing phases, such as the unit testing, integra-
tion testing, system testing, acceptance testing and smoke testing. It was found that
exploratory testing was also being used in a number of organizations for testing the DW
systems.
Even with highly rigorous testing approaches, methods and various tools, it was found that a
number of challenges are being faced by the testers of DW teams. These challenges vary
from DW project’s testing budget estimation issues to test data and communication issues.
It was also found that data source organization can be across geographical boundaries as
well. This makes the DW project essentially a global software engineering project. This is a
highly interesting observation as none of research study was found to investigate the impact
of global software engineering challenges on DW quality.
43
Figure 11: Identified challenges with respect to research methods
44
The three main classes are as under.
1. Processes
Processes are a set of software-related activities.
2. Products
Products are any artifacts, deliverables or documents that are produced as a result of
some process activity.
3. Resources
Resources are the entities that are required by a process activity. Resources include
personnel, tools, hardware etc. Sometimes, resources include products that have
been created by some other process activity. For example, a requirements specifica-
tion document can be used as a resource in testing phase.
Although, the above categorization was made for software measurement activities, we found
it suitable for our categorization of challenges.
1. Products: This category contains all the challenges that are related to artifacts, de-
liverables. The documents that are produced during the system life cycle are also
included here.
45
The specialized classes of Products are as under.
a. Documents: all the challenges which are faced during the testing of different
documents, produced during the DW development life cycle, or challenges
caused by badly written or badly designed documents, such as requirement’s
specification, guidelines, designs, development or coding standards, naming
conventions etc. There are two sub categories of this class.
i. Development Standards: Documents that are certified by organiza-
tion and considered as standards or guidelines. Examples of
development standards or guidelines include, coding standards, de-
velopment models, design standards etc. The examples of challenges
in this category include incomplete steps for database normalization,
different coding standards in different organizations etc.
ii. Requirements Specification: Documents that includes list of system
requirements, agreed by the stakeholders and the development team.
Requirements specification documents, UML diagrams and use-
cases are few examples of this category. The example of challenge
for this category is lack of clarity in requirements or incomplete re-
quirements.
b. Testing Data: Challenges related to the data prepared for testing.
c. Executable Code: Challenges related to testing of ETL scripts, SQL scripts or
any kind of executable code for DW.
d. Test Cases: Challenges related to test cases designing.
2. Process: This category contains all the challenges that are related to activities nor-
mally associated with time or software related activities.
a. Process Management: Challenges related to management of different pro-
cesses of testing.
i. Ownership assignment: Challenges that are faced due to unclear
roles in different processes execution.
ii. Effort estimation: Challenges that are encountered during the activi-
ty of effort estimation.
iii. Communication management: All challenges related to communica-
tion between the members involved.
iv. Budget management: Challenges related to estimating budget for
DW testing activities, phase etc.
b. Measurements: Challenges related to use of metrics while testing.
c. Data Tracking and Tracing: Problems while monitoring the data movement,
from the source to the target units and back tracking. These problems are re-
lated with controlling of activities in data source organizations.
3. Resources: This category contains all the challenges that are related to items used
by processes in order to produce their outputs. We exclude the artifacts and delivera-
bles that are produced during the lifecycle of the project.
a. Human resource management: Challenges related to managing the staff
members.
b. Training and skill-set building management: Challenges related to skills of
the staff.
c. Technology management: Challenges related to use of different tools or
hardware for testing.
46
i. Validation tools: tools for validating the results and making sure of
their correctness and completeness
ii. Test data generation tools: tools for generating data for testing, ei-
ther synthesized data (i.e. created from scratch and made to meet
certain scenarios), or masking data, that is, data that is modified to
hide the real values.
iii. Defect management tools: tools for managing the reported defects,
such as recording the appearance, location and time, and sorting the
defects according to the priority.
47
Figure 12: Classification of DW testing challenges
48
Figure 12: Classification of DW testing challenges
48
4.5 Validity threats
The validity threats of the study are described in this section. We have separated the validity
threats for each research method that was used in this study.
49
studies. We assumed that the authors of peer-reviewed studies used the non-peer re-
viewed studies by considering the threat of publication bias.
6. Agreement between the authors of this study
There is a possibility that difference of opinions emerge regarding the inclusion or
exclusion of a certain study. This usually happens when there is a different level of
understanding of both researchers. In order to avoid potential bias and to reduce the
impact of such threat, we conducted a pilot study. The purpose of the pilot study is
to develop a consistent understanding [47].
We selected a sample of 10 articles and concurrently reviewed them. The findings
were documented. By selecting and rejecting the articles together we were able to
develop a thorough understanding of inclusion and exclusion of studies. In order to
further mitigate this threat, we performed all of the reviews together in meeting
rooms.
50
http://translate.google.com. However, we found that the translation service was una-
ble to provide us with information without noise. Such data can lead to incorrect
understanding. It was difficult to arrange any personnel from FK IT, who could
translate the documents for us due to their busy schedules.
The risk was mitigated by conducting interviewees with different roles in the DW
team, as with the member of third-party organization. The results of the findings
were documented and were verified by the person who was in-charge of the testing
team.
2. Interviews could not be recorded
During the interviews with the different persons in FK IT, it was not possible to rec-
ord the interviews. The interviewees were found to be uncomfortable with the
recording. Therefore, in order to avoid missing the information, notes were taken. In
case of any confusion with the results, follow up interviews were held with the con-
cerned interviewee.
51
5 IMPROVEMENT SUGGESTIONS FOR DW TESTING
PROCESS
In this chapter we first provide recommendations for improving the various aspects of DW
testing and then provide the mapping between the recommendations and the categories of
challenges from the last chapter. We believe, by doing so we would be able to suggest im-
provements in reducing the impact of such challenges.
5.1 Recommendations
While most of the following recommendations come from the results of SLR, case study and
interviews, there are some challenges which were not answered anywhere in our data collec-
tion methods. Such challenges included communication issues, task assignment issues,
budgeting issues etc.
We leave such challenges for future study. However, for the sake of understandability and
guidance for future researchers, we gave our own recommendations for the improvement of
such challenges. These recommendations were made on the basis of analysis of challenges as
well as literature studies lying outside the domain of DW testing. Recommendations 1, 4, 5
and 6, were made on the basis of knowledge from interviews, our own understanding and
literature studies outside the domain of DW testing.
Rec 1. Use of universal team
As DW projects aim to consolidate the data from various sister organizations or
departments of the parent organization, we suggest that a universal team should be
created that comprises of different roles of data source organizations as well as the
customers or clients of the DW project.
The universal team should comprise at least of the following roles from each or-
ganization.
• Roles from data source organization
A tester, requirements analyst, database architect etc.
• Roles from DW organization
Tester, requirements analyst, DW architect etc.
It is a high possibility that data source organization lie across the geographical
boundaries, that is, in a different country, city, etc. Therefore, it is difficult to en-
sure physical presence of the team members within the team. In such scenario, a
protocol should be designed, so that the DW organization and the data source or-
ganization representatives are aware of any modification requests to the source
systems or the DW systems. This can be performed by simply creating a web por-
tal that can be viewed strictly by the universal team. Any requests for modification
should be placed on the web portal. In case, issues are detected, the respective
members can place comments on the request. In case of any meetings, video con-
ference systems can be used for simulating physical presence of the team
members.
We motivate our suggestion by saying that once the DW has been created, the data
source organizations are not at liberty to modify the schema or the design of the
components that are related to the DW. For example, a data source organization
may have various database tables that help to populate the DW. Once the DW has
52
been created, it is not feasible to perform any modifications to the source tables
without communicating with the DW organization.
We believe by the creation of such team, the challenges of data tracking or tracing
and ownership assignment can be easily handled, as it will be easy to find the re-
sponsible personnel for a certain issue. Furthermore, such teams would be able to
support the requirements phase by bringing requirements clarity to the clients, us-
ers or customers of the DW project. Answers to questions such as, “what reports
can be developed?” or “what cannot be done with the existing data?” can be
easily and quickly solved by using this approach.
Rec 2. Training of testers
Interviewees have recommended that the testers should be provided with regular,
monthly or bi-annual, trainings related to the organization’s business understand-
ing, development methodologies, database design, queries designing and other
topics related to the DW development.
The testers should also be trained to think-out-of-the-box and in using exploratory
testing approach. Testers should also be encouraged to give presentations to other
testers. Such practice can help in distributing the knowledge to other testers as well
as in helping other testers for finding solutions for some unique challenges.
Rec 3. Use of software metrics
Software metrics assist in characterizing different deliverables or units of software
engineering projects, processes and the efforts of professionals. These units or de-
liverables can be analysis documents, design of the system, software code, test
cases, productivity of an individual etc. [18]. If used in a proper manner, software
metrics can help in predicting the success or the failure of a project, release or
product [32].
53
Rec 4. Use of exploratory testing approach
Software testing is a difficult art that requires the tester to have broad creative
thinking and an exploratory nature. Different testing methods are used during a
software project. All testing methods and approaches are designed to test the prod-
uct for defects [56]. Common approach for testing involves creation of test cases
followed by their execution. Such approach is referred as scripted testing or test
case based testing (TCBT) [9,64].
The purpose of using different forms of testing, discussed in Section 4.1.2.2.2.4.1
and Section 4.3.4.3, is that different testing methods target the software application
from different perspectives. Similarly, another testing approach, called exploratory
testing, has been suggested by three interviews. Exploratory testing is “a style of
software testing that emphasizes the personal freedom and responsibility of the in-
dividual tester, to continually optimize the quality of his/her work by treating test-
related learning, test design, test execution, and test result interpretation as mutu-
ally supportive activities that run in parallel throughout the project” [42].
Exploratory testing helps to maximize the utilization of the tester’s creativity dur-
ing test execution. The benefits come in two forms. Firstly Exploratory testing
helps to find more critical defects in comparison to TCBT [64]. Secondly, when a
tester works with the application, he learns the application, enjoys his work, as
well as out of curiosity he comes across many scenarios which could have been
missed by the scripted test cases. In TCBT there is a higher possibility that some
scenarios might get missed during the test case generation phase. However, ex-
ploratory testing addresses this issue by giving the tester an opportunity to think
out of the box, while using the application. It is wrongful to say that one technique
is better than other. Both approaches have their pros and cons. However, if TCBT
is combined with Exploratory testing, the negative aspects of both approaches can
be reduced. Testers in industry use both TCBT and exploratory testing side by side
[64].
As exploratory testing is a way of thinking, it can be applied in any form of test-
ing. For example, while performing integration testing in DW, the tester can
investigate different test scenarios while using the system. The DW architect can
have the exploratory way of thinking while checking the database schema. He may
find missing constraints in table. Likewise, the possibilities of applying explorato-
ry testing approach are numerous. All testers, developers should be trained to have
an exploratory way of thinking. It can be used in different forms and in different
ways.
According to an interviewee at one organization weekend testing was performed.
In weekend testing, testers were allowed to select any charter, or theme of testing,
as they liked. The testers performed exploratory testing, by playing with the soft-
ware based upon the charter or theme of testing.
Exploratory testing can address the problems arising due to requirements defects.
After the scripted testing or TCBT phase has been completed, a one day bug bash
can be performed. In bug bash, selected representatives from analysts, marketing
department, DW architects, BI managers, users, developers, testers etc. are made
to sit in one hall. The charter of testing is decided and testing is executed.
The benefits of using exploratory testing with bug bash are huge. Firstly, all can-
didates test the system from their domain’s perspective. Secondly, if any defects
are found, the issue is easily resolved as requirements become clearer with candi-
dates from all concerned departments sitting together. One of the challenges of
DW testing was the lack of involvement with the customer. Bug bash with explor-
atory testing can easily address this challenge.
54
Exploratory testing can also be used for creating test data. Various scenarios can
be created by using exploratory approach for generating test data from various
tools or data sources.
Rec 5. Development of communities-of-practice for effective communication
It is very difficult to transfer the tacit knowledge of testers to other testers by doc-
umentation. By encouraging informal communication among team members, and
introducing tea breaks or by holding evening get-togethers, the employees will be
able to communicate with each other. This leads to trust building and experience
sharing among different testers and the DW team.
Such activities help in building communities-of-practice and encourages
knowledge-sharing by the team members within the whole organization [14].
When the knowledge is distributed within the organization, the impact of employ-
ees leaving the organization can be drastically reduced.
Rec 6. Use of defect management tools
Defect management tools can provide graphical reports related to defects per re-
lease with size of release and severity of defects. By using this information testers
can track the change requirements requests, defect resolutions etc. They can also
assist in process improvement of DW testing as the trend of rising defects or de-
creasing defects can be used as indicators for identifying critical error-prone or
risky modules, interfaces, features etc.
Rec 7. Use of validation tools
By automating the testing procedures, a lot of time can be saved. In Section
4.1.2.2.2.4.2 and Section 4.3.4.3, various tools have been presented as suggested
by the researchers and industry practitioners, respectively. These tools can be used
for SQL development, database tuning, performing regression testing, automated
ETL testing etc.
Rec 8. Test data creation strategies
One of the interviewees stated that the data sources are used to create the test data.
We propose that the creation of testing data at the data sources can be made more
effective by using exploratory approach while generating the test data. The testers
at data source organizations can create data based on various scenarios while ex-
ploring the data source systems. Data created this way, will be highly rich in
scenarios, will be meaningful and it will maintain all of the integrity and business
constraints.
It is possible that such practice will ask for high time requirement. In such cases,
first the small amount of testing data can be created using the approach just men-
tioned. In the second step, the created test data can be exploded by using various
data test data generation tools. Commercial as well as opensource tools proposed
by researchers and industry practitioners, in Section 4.1.2.2.2.4.2 and Section
4.3.4.3, can be used to create large amounts of high quality testing data.
Rec 9. Use of carefully controlled databases for testing purposes
Some identified challenges were related to the time wasted during development of
SQL queries. If used improperly, a simple query, involving few OUTER JOINs in
DW tables, can take from few seconds to even 30 minutes to show the results. This
is the case when database sizes are quite big, that is, in Gigabytes or Terabytes.
Such scenarios lead to decrease in tester’s productivity and make the tester lose
focus from testing.
An interviewee suggested the use of carefully controlled databases for different
testing methods and techniques. It should be noted that different test phases and
55
testing techniques focuses on testing different aspects of the product. For example,
unit testing focuses on testing the smallest piece of software application, isolated
with other pieces and tested to identify the code behavior. It is not necessary to
have a very large test database for unit testing.
Based on the interviewee’s suggestions, we state the sizes of databases for various
purposes in Table 17.
Table 17: Controlled databases for various types of testing
Type of testing Suggested database size
Unit testing Few hundred megabytes
Module testing Few hundred megabytes
Integration testing Few hundred gigabytes
56
5.2 Mapping of challenges classes with recommendations
Table 18 provides the mapping of challenge classes and recommendations that can help to
reduce their impact.
Table 18: Mapping of challenges and recommendations
Category name Recommendations for improvement
Development standards • Rec 10
Requirements standards • Rec 10
Testing data • Rec 8
• Rec 9
Executable code • Rec 2
• Rec 3
• Rec 4
Test cases • Rec 4
Ownership assignment • Rec 1
• Rec 10
Effort estimation • Rec 12
Measurements • Rec 3
Data tracking and tracing • Rec 1
• Rec 10
Communication management • Rec 1
• Rec 5
Human resource management • Rec 5
Training and skill-set management • Rec 2
• Rec 4
Budget management By using all of the recommendations, the impact of chal-
lenges can be reduced, thus, it will lead to effective
budget management
Validation tools • Rec 7
Test data generation tools • Rec 8
• Rec 9
Defect management tools • Rec 6
57
6 CONCLUSION
In this study we consolidated the research in the field of DW testing by conducting a SLR
followed by SST. By conducting interviews and a case study we were able to gather more
practices of DW testing, a collection of tools and additional challenges in DW testing.
The research questions were answered as follow:
RQ2. What are the improvement opportunities for improving the DW testing process?
Based on the categories of classification we provide improvement suggestions in Chap-
ter 5. The recommendations can help to reduce the impact of identified challenges.
Most of the challenges which were found in study were related to test data creation and, the
need for tools for various purposes of DW testing. The rising trend of DW systems requires a
standardized testing approach and tools that can help to save time by automating the testing
process. While such tools are available commercially as well as from the open source com-
munity, however, there is a lack of such tools for DW testing.
Moreover, as a result of industry interviews, the study found management related challenges
which impact the DW testing process. Such challenges included communication issues, task
assignment issues, budgeting issues etc. These challenges were not reported by any of the
literature study. This opens up a new horizon and direction for the future research.
We conclude that there is a need for empirical studies that can propose a comprehensive
framework for testing the DW systems and studies that propose various tools that can help to
automate the testing tasks. We also propose studies that can investigate DW testing process
from the managing perspective.
58
7 FUTURE WORK
In this study we identified a number of challenges, testing techniques, software metrics for
DW projects and various tools that can support DW testing activities. Based on the results of
this study it was evident that no standardized testing framework is being used in the industry.
Moreover the industry is not aware of the various testing tools and software metrics related
to DW projects. With this study we have tried to increase the awareness about these tech-
niques.
After the consolidation of research in DW testing, we now plan to develop a standardized
framework for testing the DW systems. Moreover, by developing or extending, the identified
open source tools, we plan to enhance the quality of testing tools and increase the efficiency
and effectiveness of test data generation tools. The future study will also involve an evalua-
tion of identified testing tools to observe their performance, features and various other
attributes.
The study has identified that the testing process is hindered not just by testing techniques,
measures, or tools but by management activities as well. Management activities include lack
of communication, inability to estimate testing infrastructure budget, task assignment etc. It
was also seen that data source organizations can be in more than one country, which may
lead to the challenges similar to the ones faced by Global Software Engineering (GSE).
Thus, the future work will also include a thorough investigation of management factors that
impact the quality of the DW system. In addition, the impact of GSE environment on DW
quality needs to be studied.
59
REFERENCES
1. Allan, G. A Critique of using Grounded Theory as a Research Method. Electronic Journal of
Business Research Methods 2, 1 (2003).
2. Artz, J.M. How good is that data in the warehouse? Data Base for Advances in Information
Systems 28, 3 (1997), 21-31.
3. Babar, M.A. and Zhang, H. Systematic literature reviews in software engineering: Preliminary
results from interviews with researchers. Empirical Software Engineering and Measurement,
2009. ESEM 2009. 3rd International Symposium on, (2009), 346-355.
4. Bai, X. Testing the Performance of an SSAS Cube Using VSTS. Information Technology:
New Generations (ITNG), 2010 Seventh International Conference on, (2010), 986-991.
5. Bailey, K. Methods of social research. Free Press, 2007.
6. Bailey, K.D. Typologies and taxonomies: An introduction to classification techniques. Sage
Publications, Inc, 1994.
7. Berenguer, G., Romero, R., Trujillo, J., Serrano, M., and Piattini, M. A Set of Quality Indica-
tors and Their Corresponding Metrics for Conceptual Models of Data Warehouses. In A. Tjoa
and J. Trujillo, eds., Data Warehousing and Knowledge Discovery. Springer Berlin / Heidel-
berg, 2005, 95-104.
8. Berson, A. and Smith, S.J. Components of a Data Warehouse. THE DATA
ADMINISTRATION NEWSLETTER - TDAN.com. http://www.tdan.com/view-articles/4213.
9. Bhatti, K. and Ghazi, A.N. Effectiveness of Exploratory Testing An empirical scrutiny of the
challenges and factors. 2010.
10. Biolchini, J., Mian, P.G., Natali, A.C.C., and Travassos, G.H. Systematic review in software
engineering. System Engineering and Computer Science Department COPPE/UFRJ, Tech-
nical Report ES 679, 05 (2005).
11. Bourque, P. and Dupuis, R. Guide to the Software Engineering Body of Knowledge 2004
Version. Guide to the Software Engineering Body of Knowledge, 2004. SWEBOK, (2004).
12. Brahmkshatriya, K. Data Warehouse Testing. Stickyminds.com, 2007.
http://www.stickyminds.com/s.asp?F=S12340_ART_2.
13. Brereton, P., Kitchenham, B., Budgen, D., Turner, M., and Khalil, M. Lessons from applying
the systematic literature review process within the software engineering domain. Journal of
Systems and Software 80, 4 (2007), 571-583.
14. Brown, J.S. and Duguid, P. Organizational learning and communities-of-practice: Toward a
unified view of working, learning, and innovation. Organization science, (1991).
15. Calero, C., Piattini, M., Pascual, C., and Manuel, A.S. Towards data warehouse quality met-
rics. Proceedings of the International Workshop on Design and Management of Data
Warehouses, (2001), 1-10.
16. Chays, D., Dan, S., Frankl, P.G., Vokolos, F.I., and Weber, E.J. A framework for testing data-
base applications. SIGSOFT Softw. Eng. Notes 25, 5 (2000), 147-157.
17. Chellamuthu, S. and Punithavalli, D.M. Detecting Redundancy in Biological Databases? An
Efficient Approach. Global Journal of Computer Science and Technology 9, 4 (2009).
18. Chhabra, J.K. and Gupta, V. A survey of dynamic software metrics. J. Comput. Sci. Technol.
25, 5 (2010), 1016-1029.
19. Ciszak, L. Application of clustering and association methods in data cleaning. International
Multiconference on Computer Science and Information Technology, 2008. IMCSIT 2008.,
IEEE (2008), 97–103.
20. Cooper, R. and Arbuckle, S. How to Thoroughly Test a Data Warehouse. STAREAST Orlan-
do, (2002).
21. Creswell, J.W. Research design: Qualitative, quantitative, and mixed methods approaches.
Sage Publications, Inc, 2009.
22. Dell’Aquila, C., Tria, F. Di, Lefons, E., and Tangorra, F. Logic programming for data ware-
house conceptual schema validation. DaWaK’10 Proceedings of the 12th international
conference on Data warehousing and knowledge discovery, Springer-Verlag Berlin, Heidel-
berg (2010), 1–12.
23. Eno, J. and Thompson, C.W. Generating Synthetic Data to Match Data Mining Patterns. In-
ternet Computing, IEEE 12, 3 (2008), 78-82.
60
24. Farré, C., Rull, G., Teniente, E., and Urp’\i, T. SVTe: a tool to validate database schemas
giving explanations. Proceedings of the 1st international workshop on Testing database sys-
tems, ACM (2008), 1–6.
25. Fenton, N. and Pfleeger, S.L. Software Metrics: A Rigorous and Practical Approach. PWS
Publishing Co., 1997.
26. Galhardas, H. Data Cleaning and Transformation Using the AJAX Framework. Generative
and Transformational Techniques in Software Engineering, (2006), 327--343.
27. Ganapathi, A., Kuno, H., Dayal, U., et al. Predicting multiple metrics for queries: Better deci-
sions enabled by machine learning. IEEE International Conference on Data Engineering,
IEEE (2009), 592–603.
28. Golfarelli, M. and Rizzi, S. A comprehensive approach to data warehouse testing. Interna-
tional Conference on Information and Knowledge Management, Proceedings, (2009), 17-24.
29. Haertzen, D. Testing the data warehouse.
http://infogoal.com/datawarehousing/testing_the_data_warehouse.htm.
30. Haraty, R.A., Mansour, N., and Daou, B. Regression testing of database applications. Pro-
ceedings of the 2001 ACM symposium on Applied computing, ACM (2001), 285-289.
31. Hay, J. Blog by Justin Hay. http://justinhay.wordpress.com/, 2011.
http://justinhay.wordpress.com/.
32. Henderson-Sellers, B. Object-oriented metrics: measures of complexity. Prentice-Hall, Inc.,
1996.
33. Hoag, J.E. and Thompson, C.W. A parallel general-purpose synthetic data generator. ACM
SIGMOD Record 36, 1 (2007), 19-24.
34. Houkjær, K., Torp, K., and Wind, R. Simple and realistic data generation. Proceedings of the
32nd international conference on Very large data bases (VLDB ’06), VLDB Endowment
(2006), 1243--1246.
35. Huynh, T., Nguyen, B., Schiefer, J., and Tjoa, A. BEDAWA-A Tool for Generating Sample
Data for Data Warehouses. Data Warehousing and Knowledge Discovery, (2000), 83–93.
36. Inmon, W.H. Building the data warehouse. John Wiley, 2005.
37. Inmon, W.H. Building the data warehouse. Wiley-India, 2009.
38. Jarke, M. Architecture and quality in data warehouses: An extended repository approach.
Information Systems 24, 3 (1999), 229-253.
39. Jeske, D., Lin, P., Rendon, C., Xiao, R., and Samadi, B. Synthetic data generation capabilties
for testing data mining tools. MILCOM’06 Proceedings of the 2006 IEEE conference on Mili-
tary communications, IEEE Press (2006), 1-6.
40. Jeske, D.R., Samadi, B., Lin, P.J., et al. Generation of synthetic data sets for evaluating the
accuracy of knowledge discovery systems. Proceedings of the ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, (2005), 756-762.
41. Kaner, C. and Bach, J. Exploratory testing in pairs. Software Testing Analysis & Review Con-
ference (STAR) West, (2001).
42. Kaner, C. A Tutorial in Exploratory Testing. 2008.
43. Kimball, R. and Caserta, J. The Data Warehouse ETL Toolkit: Practical Techniques for Ex-
tracting, Cleanin. John Wiley & Sons, 2004.
44. Kimball, R. and Caserta, J. The data warehouse ETL toolkit. Wiley-India, 2009.
45. Kimball, R. and Ross, M. The Kimball Group Reader: Relentlessly Practical Tools for Data
Warehousing and Business Intelligence. Wiley Publishing, 2010.
46. Kimball, R., Ross, M., Thorthwaite, W., Becker, B., and Mundy, J. The data warehouse
lifecycle toolkit. Wiley-India, 2009.
47. Kitchenham, B. and Charters, S. Guidelines for performing systematic literature reviews in
software engineering. Engineering 2, EBSE 2007-001 (2007).
48. Kitchenham, B. Procedures for performing systematic reviews. Keele, UK, Keele University
33, (2004).
49. Kitchenham, B.A., Dyba, T., and Jorgensen, M. Evidence-Based Software Engineering. Pro-
ceedings of the 26th International Conference on Software Engineering, IEEE Computer
Society (2004), 273-281.
50. Madsen, M. A 50% Data Warehouse Failure Rate Is Nothing New. Toolbox.com.
http://it.toolbox.com/blogs/bounded-rationality/a-50-data-warehouse-failure-rate-is-nothing-
new-4669.
51. Mathen, M.P. Data Warehouse Testing. www.Infosys.com, 2010.
http://www.infosys.com/offerings/IT-services/independent-validation-testing-services/white-
papers/Documents/data-warehouse-testing.pdf.
61
52. Mendes, E. and Kitchenham, B. Protocol for systematic review. (2004).
53. Mookerjea, A. and Malisetty, P. Data Warehouse / ETL Testing: Best Practices. Test India,
(2008).
54. Muñoz, L., Mazón, J.-N., and Trujillo, J. A family of experiments to validate measures for
UML activity diagrams of ETL processes in data warehouses. Information and Software
Technology 52, 11 (2010), 1188-1203.
55. Muñoz, L., Mazón, J.N., and Trujillo, J. Measures for ETL processes models in data ware-
houses. MoSE+DQS ’09; Proceeding of the first international workshop on Model driven
service engineering and data quality and security, ACM Hong Kong, China (2009), 33–36.
56. Myers, G.J. The Art of Software Testing. Wiley Publishing, 2004.
57. Rainardi, V. Building a data warehouse: with examples in SQL Server. A1bazaar, 2008.
58. Rakitin, S.R. Software verification and validation for practitioners and managers. Artech
House, 2001.
59. Ratcliff, D. 15 Methods of Data Analysis in Qualitative Research. qualita-
tiveresearch.ratcliffs.net. qualitativeresearch.ratcliffs.net/15methods.pdf.
60. Runeson, P. and Höst, M. Guidelines for conducting and reporting case study research in
software engineering. Empirical Software Engineering 14, 2 (2009), 131-164.
61. Seidel, J.V. Qualitative Data Analysis. Qualitative Data Analysis, in The Ethnograph v5.0: A
Users Guide, Appendix E, 1998, Colorado Springs, Colorado: Qualis Research, 1998.
www.qualisresearch.com.
62. Serrano, M., Calero, C., Trujillo, J., Luján-Mora, S., and Piattini, M. Empirical Validation of
Metrics for Conceptual Models of Data Warehouses. In A. Persson and J. Stirna, eds., Ad-
vanced Information Systems Engineering. Springer Berlin / Heidelberg, 2004, 493-510.
63. Serrano, M., Trujillo, J., Calero, C., and Piattini, M. Metrics for data warehouse conceptual
models understandability. Information and Software Technology 49, 8 (2007), 851-870.
64. Shah, S.M.A. and Alvi, U.S. A Mix Testing Process Integrating Two Manual Testing Ap-
proaches: Exploratory Testing and Test Case Based Testing. 2010.
65. Simon, A.R. Data warehousing for dummies. John Wiley & Sons, Inc., 1997.
66. Singh, J. and Singh, K. Designing a Customized Test Data Generator for Effective Testing of
a Large Database. International Conference on Advanced Computer Theory and Engineering,
2008. ICACTE’08, IEEE Computer Society (2008), 84 - 88.
67. Singh, J. and Singh, K. Statistically Analyzing the Impact of Automated ETL Testing on the
Data Quality of a Data Warehouse. International Journal of Computer and Electrical Engi-
neering (IJCEE) 1, 4 (2009), 488-495.
68. Sneed, H.M. Testing a Datawarehouse - An Industrial Challenge. Testing: Academic and
Industrial Conference - Practice And Research Techniques, 2006. TAIC PART 2006. Proceed-
ings, (2006), 203-210.
69. Spillner, A., Linz, T., and Schaefer, H. Software Testing Foundations: A Study Guide for the
Certified Tester Exam. Rocky Nook, 2007.
70. Tanugka, P., Moravcik, O., Vazan, P., and Miksa, F. The Proposal of the Essential Strategies
of DataWarehouse Testing. Central European Conference on Information and Intelligent Sys-
tems 19 th International Conference 2008 Proceedings, Faculty of Organization and
Informatics, Pavlinska 2, Varazadin, 42000, Croatia (2008).
71. Tanuška, P., Moravčík, O., Važan, P., and Miksa, F. The Proposal of Data Warehouse Testing
Activities. Proceedings of the 20th Central European Conference on Information and Intelli-
gent Systems, (2009), 7-11.
72. Theobald, J. Strategies for Testing Data Warehouse Applications. Information Management
Magazine, 2007. http://www.information-management.com/issues/20070601/1086005-1.html.
73. Thomsen, C. and Pedersen, T.B. ETLDiff: a semi-automatic framework for regression test of
ETL software. Data Warehousing and Knowledge Discovery. 8th International Conference,
DaWaK 2006. Proceedings (Lecture Notes in Computer Science Vol. 4081), (2006), 1-12.
74. Vijayaraghavan, G. and Kaner, C. Bug taxonomies: Use them to generate better tests. Soft-
ware Testing Analysis \& Review Conference (STAR) East, Citeseer (2003).
75. Willmor, D. and Embury, S.M. A safe regression test selection technique for database-driven
applications. Software Maintenance, 2005. ICSM’05. Proceedings of the 21st IEEE Interna-
tional Conference on, (2005), 421-430.
76. Willmor, D. and Embury, S.M. Testing the implementation of business rules using intensional
database tests. Proceedings of the Testing: Academic & Industrial Conference on Practice
And Research Techniques; TAIC-PART ’06, IEEE Computer Society (2006).
62
77. Yannikos, Y., Franke, F., Winter, C., and Schneider, M. 3LSPG: Forensic Tool Evaluation by
Three Layer Stochastic Process-Based Generation of Data. Proceedings of the 4th interna-
tional conference on Computational forensics, IWCF’10, Springer Berlin / Heidelberg (2010),
200-211.
78. Yin, R.K. Case Study Research: Design and Methods. Sage Publications, Inc, 2002.
79. International Standard ISO/IEC 9126-1: Software engineering - Product quality - Quality
model. 2001.
80. TECHNICAL REPORT ISO / IEC TR 9126-3: Software engineering - Product quality - Inter-
nal metrics. 2003.
81. TECHNICAL REPORT ISO / IEC TR 9126-2: Software engineering - Product quality - Exter-
nal metrics. 2003.
82. TECHNICAL REPORT ISO / IEC TR 9126-4: Software Engineering - Product Quality - Part
4: Quality in use metrics. 2004.
83. TECHNICAL REPORT ISO / IEC TR 25021: Software engineering : Software Product Qual-
ity Requirements and Evaluation (SQuaRE) - Quality measure elements. 2007.
84. Smoke Testing. www.toolbox.com, 2008.
http://it.toolbox.com/wiki/index.php/Sanity_and_Smoke_Testing.
85. Certified Tester Expert Level Syllabus Improving the Testing Process, V 1.0.2. 2010.
86. Definition of the word Problem. Cambridge Dictionaries Online.
http://dictionary.cambridge.org/dictionary/british/problem?q=Problem.
87. Definition of the word Challenge. Cambridge Dictionaries Online.
http://dictionary.cambridge.org/dictionary/british/challenge_1?q=challenge.
88. Data Warehouse Components and Framework. Execution-Mih.
http://www.executionmih.com/data-warehouse/components-framework.php.
89. Grey Literature. University of Ottawa. http://www.biblio.uottawa.ca/content-
page.php?g=en&s=rgn&c=src-litgris.
90. Data warehouse concepts. www.oracle.com.
http://download.oracle.com/docs/cd/B28359_01/server.111/b28313/concept.htm.
63
APPENDIX
For Abstract:
Owner:GUIDE AND (
("Abstract":"data warehouse" OR "Abstract":"data warehouses" OR "Ab-
stract":"data mart" OR "Abstract":"data marts" OR "Abstract":"Business
Intelligence" OR "Abstract":"ETL" OR(("Abstract":"extract" OR "Ab-
stract":"extraction") AND ("Abstract":"transform" OR "Abstract":"transformation"
OR "Abstract":"transforming") AND ("Abstract":"loading" OR "Ab-
stract":"load")) OR "Abstract":"large database" OR "Abstract":"large databases"
OR "Abstract":"OLAP" OR ("Abstract":"online" AND "Abstract":"analytical"
AND "Abstract":"processing")
) AND (
"Abstract":"testing" OR "Abstract":"test" OR "Abstract":"quality assurance" OR
"Abstract":"quality control" OR "Abstract":"validation" OR "Ab-
stract":"verification"
) AND (
("Abstract":"challenge" OR "Abstract":"challenges" OR "Abstract":"problem" OR
"Abstract":"problems" OR "Abstract":"difficulties" OR "Abstract":"difficulty" OR
"Abstract":"issues" OR "Abstract":"hardships" OR "Ab-
stract":"hardship")OR("Abstract":"tool" OR "Abstract":"tools" OR
"Abstract":"approach" OR "Abstract":"approaches" OR "Abstract":"technique"
OR "Abstract":"techniques" OR "Abstract":"strategy" OR"Abstract":"strategies"
OR "Abstract":"process" OR "Abstract":"processes" OR "Abstract":"framework"
OR "Abstract":"frameworks" OR "Abstract":"automatic"
64
OR"Abstract":"automation" OR "Abstract":"automate" OR "Ab-
stract":"automated" OR "Abstract":"automating" OR "Abstract":method*))
)
For Title:
Owner:GUIDE AND (
("Title":"data warehouse" OR "Title":"data warehouses" OR "Title":"data mart"
OR "Title":"data marts" OR "Title":"Business Intelligence" OR "Title":"ETL"
OR(("Title":"extract" OR "Title":"extraction") AND ("Title":"transform" OR "Ti-
tle":"transformation" OR "Title":"transforming") AND ("Title":"loading" OR
"Title":"load")) OR "Title":"large database" OR "Title":"large databases" OR "Ti-
tle":"OLAP" OR ("Title":"online" AND "Title":"analytical" AND
"Title":"processing")
) AND (
"Title":"testing" OR "Title":"test" OR "Title":"quality assurance" OR "Ti-
tle":"quality control" OR "Title":"validation" OR "Title":"verification"
) AND (
("Title":"challenge" OR "Title":"challenges" OR "Title":"problem" OR "Ti-
tle":"problems" OR "Title":"difficulties" OR "Title":"difficulty" OR
"Title":"issues" OR "Title":"hardships" OR "Title":"hardship")OR("Title":"tool"
OR "Title":"tools" OR "Title":"approach" OR "Title":"approaches" OR "Ti-
tle":"technique" OR "Title":"techniques" OR "Title":"strategy"
OR"Title":"strategies" OR "Title":"process" OR "Title":"processes" OR "Ti-
tle":"framework" OR "Title":"frameworks" OR "Title":"automatic"
OR"Title":"automation" OR "Title":"automate" OR "Title":"automated" OR "Ti-
tle":"automating" OR "Title":method*))
)
For Keywords:
Owner:GUIDE AND (
("Keywords":"data warehouse" OR "Keywords":"data warehouses" OR "Key-
words":"data mart" OR "Keywords":"data marts" OR "Keywords":"Business
Intelligence" OR "Keywords":"ETL" OR(("Keywords":"extract" OR "Key-
words":"extraction") AND ("Keywords":"transform" OR
"Keywords":"transformation" OR "Keywords":"transforming") AND ("Key-
words":"loading" OR "Keywords":"load")) OR "Keywords":"large database" OR
"Keywords":"large databases" OR "Keywords":"OLAP" OR ("Key-
words":"online" AND "Keywords":"analytical" AND "Keywords":"processing")
) AND (
"Keywords":"testing" OR "Keywords":"test" OR "Keywords":"quality assurance"
OR "Keywords":"quality control" OR "Keywords":"validation" OR "Key-
words":"verification"
) AND (
("Keywords":"challenge" OR "Keywords":"challenges" OR "Key-
words":"problem" OR "Keywords":"problems" OR "Keywords":"difficulties" OR
"Keywords":"difficulty" OR "Keywords":"issues" OR "Keywords":"hardships"
OR "Keywords":"hardship")OR("Keywords":"tool" OR "Keywords":"tools" OR
"Keywords":"approach" OR "Keywords":"approaches" OR "Key-
words":"technique" OR "Keywords":"techniques" OR "Keywords":"strategy"
OR"Keywords":"strategies" OR "Keywords":"process" OR "Key-
words":"processes" OR "Keywords":"framework" OR "Keywords":"frameworks"
OR "Keywords":"automatic" OR"Keywords":"automation" OR "Key-
65
words":"automate" OR "Keywords":"automated" OR "Keywords":"automating"
OR "Keywords":method*))
)
66
form" OR "transformation" OR "transforming") AND ("loading" OR "load")) OR
"large database" OR "large databases" OR "OLAP" OR ("online" AND "analytical"
AND "processing")
) AND (
"testing" OR "test" OR "quality assurance" OR "quality control" OR "validation"
OR "verification"
) AND (
( "challenge" OR "challenges" OR "problem" OR "problems" OR "difficulties" OR
"difficulty" OR "issues" OR "hardships" OR "hardship" ) OR ( "tool" OR "tools" OR
"approach" OR "approaches" OR "technique" OR "techniques" OR "strategy" OR
"strategies" OR "process" OR "processes" OR "framework" OR "frameworks" OR
"automatic" OR "automation" OR "automate" OR "automated" OR "automating" OR
method* )))
)
OR abstract:(
(( "data warehouse" OR "data warehouses" OR "data mart" OR "data marts" OR
"Business Intelligence" OR "ETL" OR (("extract" OR "extraction") AND ("trans-
form" OR "transformation" OR "transforming") AND ("loading" OR "load")) OR
"large database" OR "large databases" OR "OLAP" OR ("online" AND "analytical"
AND "processing")
) AND (
"testing" OR "test" OR "quality assurance" OR "quality control" OR "validation"
OR "verification"
) AND (
( "challenge" OR "challenges" OR "problem" OR "problems" OR "difficulties" OR
"difficulty" OR "issues" OR "hardships" OR "hardship" ) OR ( "tool" OR "tools" OR
"approach" OR "approaches" OR "technique" OR "techniques" OR "strategy" OR
"strategies" OR "process" OR "processes" OR "framework" OR "frameworks" OR
"automatic" OR "automation" OR "automate" OR "automated" OR "automating" OR
method* )))
)
OR keyword:(
(( "data warehouse" OR "data warehouses" OR "data mart" OR "data marts" OR
"Business Intelligence" OR "ETL" OR (("extract" OR "extraction") AND ("trans-
form" OR "transformation" OR "transforming") AND ("loading" OR "load")) OR
"large database" OR "large databases" OR "OLAP" OR ("online" AND "analytical"
AND "processing")
) AND (
"testing" OR "test" OR "quality assurance" OR "quality control" OR "validation"
OR "verification") AND (( "challenge" OR "challenges" OR "problem" OR "prob-
lems" OR "difficulties" OR "difficulty" OR "issues" OR "hardships" OR "hardship"
) OR ( "tool" OR "tools" OR "approach" OR "approaches" OR "technique" OR
"techniques" OR "strategy" OR "strategies" OR "process" OR "processes" OR
"framework" OR "frameworks" OR "automatic" OR "automation" OR "automate"
OR "automated" OR "automating" OR method* )))
)
67
6. Research database: Engineering Village
(
( "data warehouse" OR "data warehouses" OR "data mart" OR "data marts" OR
"Business Intelligence" OR "ETL" OR (("extract" OR "extraction") AND ("trans-
form" OR "transformation" OR "transforming") AND ("loading" OR "load")) OR
"large database" OR "large databases" OR "OLAP" OR ("online" AND "analytical"
AND "processing")
) AND (
"testing" OR "test" OR "quality assurance" OR "quality control" OR "validation"
OR "verification"
) AND (
( "challenge" OR "challenges" OR "problem" OR "problems" OR "difficulties" OR
"difficulty" OR "issues" OR "hardships" OR "hardship" ) OR ( "tool" OR "tools" OR
"approach" OR "approaches" OR "technique" OR "techniques" OR "strategy" OR
"strategies" OR "process" OR "processes" OR "framework" OR "frameworks" OR
"automatic" OR "automation" OR "automate" OR "automated" OR "automating" OR
method* ))
) wn KY
7. Research database: CECIIS
Due to space limitation, and inability of the system, only first part of the string was
used. This was applied on All Categories.
68
Appendix B: SLR Data Extraction Form
Reviewer’s Name:
Assessment of eligibility
Question If yes (continue) If no (exclude)
Is the domain of the article related to IT?
Is the title related to DW, ETL, data marts, OLAP,
large databases, or business intelligence?
Does the abstract shows that the article is related
to quality management in general for the DW or
related systems and their parts?
69
Appendix C: Interview questions for case study
1. Questions related to introduction of the interviewee
a. What is your name?
b. What is your role in the organization?
c. How much experience do you have in data warehouse (DW) development
and testing field?
d. In what way are you involved in the DW development and testing area?
70
i. Ease of use
ii. Performance
iii. Cost
iv. Features
c. Are there any tools being used in your organization that can automate the test-
ing? If yes, what phases of testing (regression testing, module testing etc.)
are automated and what tools are used? (examples of such tools include the
opensource DBUnit and ETLDiff etc.)
d. How do you generate the test data? Which tools do you use for generating test
data?
i. Is real data being used or masked/subset or purely artificial data be-
ing used for testing?
5. Challenges
a. What are the challenges faced by you in testing and developing the DW?
b. In your opinion, what are the most difficult aspects of DW testing?
6. Process improvement
a. What are your suggestions for process improvement? How testing of a DW
can be improved in your organization?
b. Which tools in your opinion are good for different phases and parts of DW?
71
Appendix D: Interview questions for industrial interviews
Interviewee name: Designation:
1. Please describe your company briefly; what products or services do you deliver?
2. In what way are you involved in the DW development and testing area?
Few assumptions:
• For this report, we consider testing as any form of verification or validation activity. Ex-
amples of testing include requirements reviews, schema design review, use of software
metrics, unit testing, integration testing etc.
• The data warehouse development can be performed in an iterative fashion or by using a
sequential waterfall approach. The whole development lifecycle or a single iteration (in
case of iterative development) goes through the following phases. We have identified some
of the activities, but, projects can follow any number of activities.
72
3. Various testing techniques, inspections, reviews etc. can be applied in different
phases and activities in a phase. Please state how you test each phase and its activi-
ties. If you don’t perform a certain activity or if you perform some other activity
please state.
a. Requirements phase
Activity Testing technique used to verify activity
Requirements elicitation and docu-
mentation
Data sources profiling
Workload refinement
b. Design phase
Activity Testing technique used to verify activity
Logical schema design
Physical design
c. Implementation phase
Activity Testing technique used to verify activity
ETL scripts development
Integration testing
System testing
Acceptance testing
Statistical testing
73
e. Support phase
Activity Testing technique used to verify activity
Regression testing
ETL monitoring
4. Modifications to code are sometimes made due to various reasons, for example, due
to defect fixing or change requests. How do you select the test cases for regression
testing?
5. Do you use any tools for automating the testing techniques? For example, for re-
gression testing ETLDiff may be used. DBUnit, can also be scripted to regression
test ETL procedures in an automated fashion etc.
6. For various testing purposes sometimes real data, synthetic data (data created from
the scratch) or masked/encrypted data (real data modified to address data privacy)
is used. What kind of test data do you use? If you use different kinds of test data for
various purposes please state.
7. Sometimes, various test database sizes are used for different forms of testing (as
identified in question6). What test database size(s) do you use?
10. What difficulties do you face while testing the implementation phase?
11. What difficulties do you face while testing the system or in testing phase?
12. What difficulties do you face while monitoring the system while the customers or
users are accessing the system?
13. Do you face any other difficulties anytime during the project related to DW testing?
74
Following question was asked in phase 2 of industrial interviews
14. Following challenges are faced by different DW testing practitioners across the
globe. Please select the challenges that you have faced during the validation and
verification activity of a DW.
Challenge
name
and
description
Briefly
indicate
how
do
you
overcome
the
challenge
There
are
no
strict
rules
or
steps
which
can
help
the
designer
in
verifying
if
the
degree
of
normalization
is
good
for
the
design
or
not.
75
Performance
issues
with
queries
design
Poorly
designed
queries
for
reports
or
OLAP
may
perform
very
bad
in
high
volume
database
environment
It
is
difficult
to
predict
how
the
queries
of
the
report
will
per-‐
form
in
the
production
environment
Lack
of
time
for
designing,
planning
and
creating
test
cases
Due
to
the
high
number
of
test
cases,
unlimited
testing
scenar-‐
ios,
time
is
never
enough
to
properly
design,
plan
and
execute
the
test
cases.
76
Lack
of
fact
to
dimension
ratio
in
created
test
data
The
test
data
created
may
not
contain
sufficient
number
of
records
a
certain
fact
table
and
its
dimensions.
This
makes
the
data
meaningless.
The
tools
that
are
able
to
create
the
test
data,
are
very
special-‐
ized,
and
cannot
be
reused
in
different
environment.
77
Appendix E: SLR Template
For this study, following SLR template as described by Biolchini [10], was used. We de-
scribed only the headings that were applicable to our study.
1 REVIEW PLANNING
In this phase, it must be defined the research objectives and the way the review will be exe-
cuted, which includes to formulate research questions and to plan how the sources and
studies selection will be carry out. The sections of the protocol template that guide the plan-
ning phase are shown bellow.
78
1.2.3 Sources Identification
This item aims at selecting sources for the review execution.
• Sources Search Methods: describes how to execute the search for primary studies (for
instance, manual search, search through web search engines).
• Search String: case one of the selected search methods includes using keywords in
search engines it is necessary to create search strings to be run at such engines. This
item presents a set of logical expressions that combine keyword and its synonymous
arranged in a way that highest amount of relevant studies is obtained from search en-
gines.
• Sources List: initial source list in which the systematic review execution will be run.
79
• Studies Quality Evaluation: the procedures for studies selection are applied to all ob-
tained articles in order to verify if the studies fit the inclusion and exclusion criteria.
Moreover, it must be checked if the studies belong to the types selected during the
planning phase. The objective of this section is to register the results of this evaluation.
• Selection Review: studies selection must be reviewed to guarantee that the studies
quality evaluation does not eliminate relevant articles. Here, independent reviewers
may be useful. The results of the review must be recorded in this item.
2 REVIEW EXECUTION
80
2.2 Resolution of divergences among reviewers
If reviewers don’t agree on the information extracted from the studies, the divergences must
be recorded. The reviewers must reach a consensus on this matter and register it in this
section.
3 RESULTS SUMMARIZATION
This systematic review protocol section aims to present the data resulting from the selected
studies.
81