Analyzing Dependability Data with AMBER
Analyzing Dependability Data with AMBER
DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 3 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 4
DEPEND 2009 1
Marco Vieira, University of Coimbra, Portugal
DEPEND 2009 2
Marco Vieira, University of Coimbra, Portugal
• What happened?
• Repositories
Past – Data Warehouse
• What is happening?
• Why did it happen? Present • Analytical tools
• What will happen? – Reporting and querying
Future
– OLAP
• What do I want to happen?
– Data mining
• Information retrieval
DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 13 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 14
DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 15 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 16
Users
DEPEND 2009 3
Marco Vieira, University of Coimbra, Portugal
DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 19 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 20
• Target oriented
• A temporal reference must be associated to all
• Data integration and consistency data in the database
DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 21 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 22
• The data in the DW is never updated • The data warehouse must only store data
relevant for decision support
• The DW stores historic data (historic memory)
collected from the operational databases • Many operational data (needed for everyday
management) is not relevant for the DW
• After being load (from the operational
databases) there is only one operation:
– Queries
DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 23 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 24
DEPEND 2009 4
Marco Vieira, University of Coimbra, Portugal
Multidimensional view
Partial denormalization
DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 25 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 26
Product
Oil 5
• Data Warehouses built over complex E/R Sugar 3
models never succeed Coffee
Jan Feb Mar Apr
Date
DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 27 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 28
DEPEND 2009 5
Marco Vieira, University of Coimbra, Portugal
• Relationships M:1 with the business • Represent a entry point for the analysis of the
dimensions facts
• Contains normally a large number of records • Represent different point-of-views for the
analysis of the facts
• Represents typically 95% of the space used
by the DW
DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 31 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 32
DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 33 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 34
DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 35 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 36
DEPEND 2009 6
Marco Vieira, University of Coimbra, Portugal
Drill-Down Roll-up
Most generic category
Sales by time and
product Sales by store and
brand
Intermediate category
Full Detail
DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 37 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 38
DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 39 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 40
• Set of stores belonging to the same enterprise • Where to collect the data?
– POS - point of sales
• Goal: Analysis of sales
– Operational database
• Each store has several departments (food, • What to measure?
hygiene and cleaning, etc) – Sales
• Sells thousands of products • Goals?
– Maximize the profit
• Products are identified using a unique number – Maximum sales price possible
– Lower costs – More clients
DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 41 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 42
DEPEND 2009 7
Marco Vieira, University of Coimbra, Portugal
ID_product
ID_product • Example: record the daily sales for all products
ID_time
description
full_description
ID_store
ID_promotion
ID_store
name
– Analyze in detail (price, quantity, etc) the products
SKU_number
package_size units_sold store_number
store_street_address
sold every day, in each store, …
brand purchase_cost city
subcategory
category
sale_value
num_Clients
store_county
store_state • Retail sales granularity:
department store_zip
package_type
diet_type
ID_time sales_district
sales_region
– Products x Store x Promotion x Day
weight date ID_promotion
store_manager
weight_unit_of_measure
units_per_retail_case
units_per_shipping_case
day_of_week
day_number_in_month
day_number_overall
number
name
type_price_red
store_phone
store_FAX • The granularity defines the detail of the DW and
floor_plan_type
cases_per_pallet
shelf_width_cm
week_number_in_year
week_number_overall
type_advertisement
type_poster
photo_processing_type
finance_services_type
has a strong impact in the size
shelf_height_cm Month Type_coupons first_opened_date
shelf_depth_cm
……...
quarter
fiscal_period
promotion_cost
start_date
last_remodel_date
store_sqft • The granularity must be adjusted to the
year end_date grocery_sqft
holiday_flag
……….
……... frozen_sqft
meat_sqft
analysis requirements
……...
DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 45 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 46
ID_product
• Mandatory dimension that
ID_product
ID_product
ID_product
ID_time ID_time
description ID_store represents the DW temporal
ID_store description ID_store ID_store
full_description
SKU_number
ID_promotion
dependency name full_description
SKU_number
• Must characterize
ID_promotion the products name
store_number store_number
package_size units_sold
purchase_cost
store_street_address package_size as seenunits_sold
by the business
purchase_cost
store_street_address
brand
subcategory
• Must describe time city
sale_value
as seen by
store_county
brand
subcategory management
sale_value
city
store_county
category
department
the business management
num_Clients store_state category
department
num_Clients store_state
package_type
store_zip
sales_district package_type • Must contain the attributes that store_zip
sales_district
diet_type
ID_time
• IsID_promotion
typically generated in a
sales_region diet_type
ID_time
are relevant forID_promotion
posterior queries sales_region
weight date weight date
weight_unit_of_measure day_of_week synthetic
number manner store_manager
store_phone weight_unit_of_measure day_of_week number
store_manager
store_phone
units_per_retail_case
units_per_shipping_case
day_number_in_month
day_number_overall
name store_FAX units_per_retail_case • It is a strongly denormalized
day_number_in_month
day_number_overall
name store_FAX
DEPEND 2009 8
Marco Vieira, University of Coimbra, Portugal
ID_product
ID_product •ID_product
Characterizes the existing promotions
ID_product
• Must characterize the stores as
ID_time
ID_store
ID_time
ID_store
description
full_description
ID_store
seen by the business management
ID_promotion name
•description
In this example there is only ID_store
full_description one dimension related to name
ID_promotion
promotions
SKU_number store_number SKU_number store_number
package_size units_sold •package_size
Represents a very importantunits_sold
dimension
brand • Must contain the attributes
purchase_costthat are
store_street_address
city brand purchase_cost
store_street_address
subcategory sale_value • Managers want to know the impact
subcategory of promotions in the salescity
sale_value in order to
category relevant for posterior queries
num_Clients
store_county
store_state categorytarget new promotions to specificnum_Clients
store_county
products, stores and time store_state
department store_zip department store_zip
package_type
diet_type
• Includes
ID_time geographical attributes sales_district package_type
diet_type
ID_time sales_district
sales_region sales_region
weight
weight_unit_of_measure
(localization)
date
day_of_week
ID_promotion
number
store_manager weight
weight_unit_of_measure
date
day_of_week
ID_promotion
number
store_manager
store_phone store_phone
units_per_retail_case day_number_in_month name store_FAX units_per_retail_case day_number_in_month name store_FAX
• Includes
units_per_shipping_case time attributestype_price_red
day_number_overall floor_plan_type units_per_shipping_case day_number_overall type_price_red floor_plan_type
cases_per_pallet week_number_in_year type_advertisement photo_processing_type cases_per_pallet week_number_in_year type_advertisement photo_processing_type
shelf_width_cm (opening date,…).
week_number_overall type_poster finance_services_type shelf_width_cm week_number_overall type_poster finance_services_type
shelf_height_cm Month Type_coupons first_opened_date shelf_height_cm Month Type_coupons first_opened_date
shelf_depth_cm quarter promotion_cost last_remodel_date shelf_depth_cm quarter promotion_cost last_remodel_date
……... fiscal_period start_date store_sqft ……... fiscal_period start_date store_sqft
year end_date grocery_sqft year end_date grocery_sqft
holiday_flag ……... frozen_sqft holiday_flag ……... frozen_sqft
………. meat_sqft ………. meat_sqft
……... ……...
DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 49 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 50
Questions
? 3. Using DW to analyze
dependability data
DEPEND 2009 9
Marco Vieira, University of Coimbra, Portugal
Field
Exp. dataN
System
DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 55 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 56
?
Management System Exp. control data
Data Faults definition Target System
Warehouse •Net
•Statistical Readouts
Exp. Setup N •Reporting
(impact of faults)
Two types of data:
• General approach to store results from dependability
evaluation experiments • Measures collected from the target system (FACTS)
– For example, raw data representing error detection efficiency, recovery
• Data from different experiments can be compared/cross- time, failure modes, etc
exploit (only if it makes What’s
sense to compare)
inside? • Features of the target system and experimental setup
• Raw data is available (not only the final results)
that have impact on the measures (DIMENSIONS)
• Results can be analyzed and shared world wide by using – For example, attributes describing the target systems, the different
web-enabled versions of OLAP tools configurations, the workload, the faultload, etc
DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 57 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 58
System B
ts
ge
System A
r
Ta
Faultload
Workload
DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 59 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 60
DEPEND 2009 10
Marco Vieira, University of Coimbra, Portugal
The experimental setups are used as they are. You can use your Loading applications
favorite dependability evaluation tool and do the experiments • General purpose loading applications
in the usual way. It’s necessary…
• Some transformations in the data are normally necessary for
• To know the format of the raw results consistency
• To have access to the results
DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 61 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 62
Data warehouse
Analysis
• Raw data is available in a standard star schema (facts + dimensions)
• Results from different experiments are compatible and can be compared/ • Commercial OLAP tools are used to analyze the raw data and
analyzed together, then they are stored in the same star schema (or in compute the measures. These tools are designed to be used by
scheme that share at least one dimension) managers: very easy to use :-)
• If results are from different unrelated experiments then they are stored in a • Just need an internet browser to analyze the data
separated schema
DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 63 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 64
DEPEND 2009 11
Marco Vieira, University of Coimbra, Portugal
DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 67 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 68
Test
DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 69 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 70
DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 71 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 72
DEPEND 2009 12
Marco Vieira, University of Coimbra, Portugal
DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 73 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 74
DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 75 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 76
ETL
DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 77 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 78
DEPEND 2009 13
Marco Vieira, University of Coimbra, Portugal
?
DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 79 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 80
• Vision
− Become a worldwide repository for
dependability related data
DEPEND 2009 14
Marco Vieira, University of Coimbra, Portugal
DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 87 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 88
DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 89 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 90
DEPEND 2009 15
Marco Vieira, University of Coimbra, Portugal
• Data extraction
− SQL scripts to extract data from the CSV files to a temporary
database schema (data staging area)
• Data transformation
− SQL scripts transform the data into an adequate format
• Data load
− SQL scripts to load the transformed data into the data
warehouse
• Executing the loading plans created before • Data ownership policies of ADR are divided in two main
groups
• If new data becomes available we just need to rerun
− Private data
the plans
− Proprietary data
− e.g., if the benchmark is executed in other systems
− Collaborative data
• The documentation of the DBench-OLTP includes
• For the DBench-OLTP data we have decided to use a
papers and technical reports
collaborative approach
− This is considered as part of the DBench-OLTP data
− Allows other potential users of the benchmark to compare
− It is loaded to the repository and made available to the their results with the ones available in the ADR
potential readers of the data
DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 93 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 94
• On-line Analytical Processing (OLAP) tools • Selection of query type (crosstab or table) and
− Support the analysis in a very flexible way characteristics (title, graph, text area, etc)
− Provide high query performance and easy, intuitive data
navigation • Selection of measures and dimensional attributes
• Oracle Business Intelligence Discoverer Plus (ODP) • Setting the query layout
− Commercial tool included in Oracle Business Intelligence • Selection of the fields to be used to sort the results
package
− Widely used by industry Used freely for research purposes • Creation of parameters used to filter data
under an Oracle Academy Agreement
DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 95 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 96
DEPEND 2009 16
Marco Vieira, University of Coimbra, Portugal
• Murphy's law…
DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 97 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 98
http://www.amber-project.eu Questions
• Ralph Kimbal, Margy Ross, “The Data • Madeira, H., Costa, J., Vieira, M. , "The OLAP and Data
Warehousing Approaches for Analysis and Sharing of Results
Warehouse Toolkit: The Complete Guide to from Dependability Evaluation Experiments", International
Dimensional Modeling” (Second Edition), Ed. Conference on Dependable Systems and Networks, DSN-
J. Wiley & Sons, Inc, 2002. DCC 2003, San Francisco, CA, USA, June 2003
• Pintér, G., Madeira, H., Vieira, M., Pataricza, A., Majzik, I. , "A
• Ralph Kimbal, “The Data Warehouse Lifecycle Data Mining Approach to Identify Key Factors in Dependability
Toolkit”, Ed. J. Wiley & Sons, Inc, 2001. Experiments", Fifth European Dependable Computing
Conference (EDCC-5), Budapest, Hungary, April 2005
DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 101 DEPEND 2009, Athens/Glyfada, Greece, June 18,2009 102
DEPEND 2009 17
Marco Vieira, University of Coimbra, Portugal
ADR bibliography
DEPEND 2009 18