Database Concepts
Ninth Edition
Chapter 7
Data Warehouses, Business
Intelligence Systems, and Big Data
Slide in this Presentation Contain Hyperlinks. JAWS users should be
able to get a list of links by using INSERT+F7
Copyright © 2020, 2017, 2015 Pearson Education, Inc. All Rights Reserved
Learning Objectives
2
• Learn the basic concepts of data warehouses and data marts
• Learn the basic concepts of dimensional databases
• Learn the basic concepts of business intelligence (BI) systems
• Learn the basic concepts of online analytical processing (OLAP)
• Learn the basic concepts of virtualization and virtual machines
• Learn the basic concepts of cloud computing
• Learn the basic concepts of Big Data, structured storage, and the
MapReduce process
• Understand the limitations and tradeoffs of replicated, partitioned
stores as indicated by the CAP (Consistency, Availability, Partition
Tolerance) theorem
• Learn the basic concepts of JavaScript Object Notation (JSON) as a
way of structuring nonrelational data
Copyright © 2020, 2017, 2015 Pearson Education, Inc. All Rights Reserved
Big Data
3
Learn the basic concepts of data warehouses
and data marts
• Big Data is the current term for the enormous datasets
generated by applications such as search tools like Google and
Bing
• Web 2.0 social networks include:
– Facebook
– LinkedIn
– Twitter
Copyright © 2020, 2017, 2015 Pearson Education, Inc. All Rights Reserved
Figure 7.1 Storage Capacity Terms
4
Copyright © 2020, 2017, 2015 Pearson Education, Inc. All Rights Reserved
Figure 5.27 Heather Sweeney
5
Designs: Database Design
Note: From Chapter 5 showing the final database design for Heather Sweeney
Copyright © 2020, 2017, 2015 Pearson Education, Inc. All Rights Reserved
Figure 7.2 The HSD Database
6
Diagram
Oracle MySQL Community Server 8.0, Oracle Corporation
Copyright © 2020, 2017, 2015 Pearson Education, Inc. All Rights Reserved
Business Intelligence Systems
7
Learn the basic concepts of business
intelligence (BI) systems
• Business intelligence (BI) systems are information systems that:
– assist managers and other professionals in the analysis of current and
past activities and in the prediction of future events
– do not support operational activities, such as the recording and
processing of orders
§ these are supported by transaction processing systems
– support management assessment, analysis, planning and control
• BI systems fall into two broad categories:
– reporting systems that sort, filter, group, and make elementary
calculations on operational data
– data mining applications that perform sophisticated analyses on data;
analyses that usually involve complex statistical and mathematical
processing
Copyright © 2020, 2017, 2015 Pearson Education, Inc. All Rights Reserved
Figure 7.4 Characteristics of
8
Business Intelligence Applications
Copyright © 2020, 2017, 2015 Pearson Education, Inc. All Rights Reserved
BI Terms
9
Learn the basic concepts of business
intelligence (BI) systems
• Online analytical processing (OLAP) is the processing of non-
operational data used in data mining and other applications
• Data lake is a repository that includes all data relevant to a business
(re: photos, documents, other files of any type)
• A data warehouse is a database system that has data, programs,
and personnel that specialize in the preparation of data for BI
processing.
Copyright © 2020, 2017, 2015 Pearson Education, Inc. All Rights Reserved
Figure 7.3 The Relationship Between
10
Operational and BI Applications
Copyright © 2020, 2017, 2015 Pearson Education, Inc. All Rights Reserved
Figure 7.5 Components of a Data
11
Warehouse
Copyright © 2020, 2017, 2015 Pearson Education, Inc. All Rights Reserved
Problems with Operational Data
12
Learn the basic concepts of business
intelligence (BI) systems
• “Dirty data,” examples include:
– “G” for gender
– “213” for age
• Missing values
• Inconsistent data
– data that has changed (ex: customer’s phone number)
• Nonintegrated data (data from multiple sources)
• Incorrect format (ex: too many or not enough digits
• Too much data (ex: an excess number of columns)
Copyright © 2020, 2017, 2015 Pearson Education, Inc. All Rights Reserved
ETL Data Transformation
13
Learn the basic concepts of business
intelligence (BI) systems
• Data may need to be transformed for use in a data warehouse.
For example:
– {CountryCode → CountryName}
– “US” → “United States”
– Email address to Email domain
§
[email protected] → “somewhere.com”
Copyright © 2020, 2017, 2015 Pearson Education, Inc. All Rights Reserved
Figure 7.6 Data Warehouses and
14
Data Marts
Copyright © 2020, 2017, 2015 Pearson Education, Inc. All Rights Reserved
Data Warehouses Versus Data Marts
15
Learn the basic concepts of data warehouses
and data marts
• A data mart is a collection of data that is smaller than the data
warehouse that addresses a specific component
• An enterprise data warehouse (EDW) architecture is when a
data mart is combined with the data warehouse architecture.
Copyright © 2020, 2017, 2015 Pearson Education, Inc. All Rights Reserved
Figure 7.7 Characteristics of
16
Operational and Dimensional
Databases
• A dimensional database is a design used for efficient analysis
and efficient queries for data warehouses.
• Uses a star-schema with a fact table at the center (fully
normalized) and a dimensional tables radiating out from the
center (may be non-normalized), In a snowflake schema, each
dimension table is normalized.
• May use slowly changing dimensions to track data such as a
date or time dimension.
Operational Database Dimensional Database
Used for structured transaction data processing Used for unstructured analytical data
Current data are used Current and historical data are used
Data are inserted, updated, and deleted by Data are loaded and updated
users systematically, not by users
Copyright © 2020, 2017, 2015 Pearson Education, Inc. All Rights Reserved
Figure 7.8 A Star Schema
17
Copyright © 2020, 2017, 2015 Pearson Education, Inc. All Rights Reserved
Figure 7.9 The HSD–DW Star Schema
18
Oracle MySQL Community Server 8.0, Oracle Corporation
Copyright © 2020, 2017, 2015 Pearson Education, Inc. All Rights Reserved
19
Figure 7.10 The HSD–DW SQL Create Table
Statements
Copyright © 2020, 2017, 2015 Pearson Education, Inc. All Rights Reserved
Figure 7.11 The HSD–DW Table Data
20
Oracle MySQL Community Server 8.0, Oracle Corporation
Copyright © 2020, 2017, 2015 Pearson Education, Inc. All Rights Reserved
A Query to Summarize Products Sold
21
by Customer and Product
Learn the basic concepts of business
intelligence (BI) systems
• The following SQL code is used to summarize products sold by
Customer and Product
Copyright © 2020, 2017, 2015 Pearson Education, Inc. All Rights Reserved
22
Figure 7.12 The HSD–DW Query Results: Summarize Product Units
Sold by Customer and Product
Oracle MySQL Community Server 8.0, Oracle Corporation
Copyright © 2020, 2017, 2015 Pearson Education, Inc. All Rights Reserved
Figure 7.13 The Two-Dimensional
23
ProductNumber-CustomerID Matrix
Copyright © 2020, 2017, 2015 Pearson Education, Inc. All Rights Reserved
Figure 7.14 The Three-Dimensional
24
Time-ProductNumber-CustomerID
Cube
Copyright © 2020, 2017, 2015 Pearson Education, Inc. All Rights Reserved
Figure 7.15 The HSD–DW Star
25
Schema Extended for RFM Analysis
Oracle MySQL Community Server 8.0, Oracle Corporation
Copyright © 2020, 2017, 2015 Pearson Education, Inc. All Rights Reserved
Online Analytical Processing (OLAP)
26
Learn the basic concepts of online analytical
processing (OLAP)
• Online Analytical Processing (OLAP) is a technique for
dynamically examining database data:
– OLAP uses arithmetic functions such as Sum and Average
– OLAP uses the dimensional database model
• OLAP systems produce OLAP reports (also called OLAP cube)
– An OLAP report uses inputs call dimensions
– An OLAP report calculates outputs called measures
– Excel PivotTables can be used to create OLAP reports
Copyright © 2020, 2017, 2015 Pearson Education, Inc. All Rights Reserved
SQL Query for OLAP Data
27
Learn the basic concepts of online analytical
processing (OLAP)
Copyright © 2020, 2017, 2015 Pearson Education, Inc. All Rights Reserved
SQL View for O LAP Data
28
Learn the basic concepts of online analytical
processing (OLAP)
Copyright © 2020, 2017, 2015 Pearson Education, Inc. All Rights Reserved
29
Figure 7.16 The HSD-DW Query for OLAP
Results: Time-Product-Customer Cube
Oracle MySQL Community Server 8.0, Oracle Corporation
Copyright © 2020, 2017, 2015 Pearson Education, Inc. All Rights Reserved
Figure 7.17 The HSD-DW OLAP
30
ProductNumber by City Report
Excel 2019, Windows 10, Microsoft Corporation
Copyright © 2020, 2017, 2015 Pearson Education, Inc. All Rights Reserved
Figure 7.18 The HSD–DW OLAP
31
ProductNumber by City Report:
CustomerName and Year Dimensions
Added
Excel 2019, Windows 10, Microsoft Corporation
Copyright © 2020, 2017, 2015 Pearson Education, Inc. All Rights Reserved
Figure 7.19 The HSD–DW OLAP City
32
by ProductNumber, CustomerName,
and Year Report
Excel 2019, Windows 10, Microsoft Corporation
Copyright © 2020, 2017, 2015 Pearson Education, Inc. All Rights Reserved
Distributed Database Processing
33
Learn the basic concepts of virtualization and
virtual machines
• A distributed database is a database that is stored and
processed on more than one computer.
• A service cluster is a group of associated servers where the
database shared between them is called a distributed
database.
• A database is distributed when it is:
– partitioned
– replicated
– both partitioned and replicated
• Databases are distributed for two major reasons – performance
and control.
Copyright © 2020, 2017, 2015 Pearson Education, Inc. All Rights Reserved
Figure 7.20 Types of Distributed
34
Databases
Copyright © 2020, 2017, 2015 Pearson Education, Inc. All Rights Reserved
Object-Relational Databases
35
Learn the basic concepts of virtualization and
virtual machines
• Object-oriented programming (OOP) is a methodology for
designing and writing computer programs.
• Some object-oriented programming languages are Java,
Python, C++, C#, and Visual Basic.NET
• Objects are data structures that have both methods
(programs that perform some task with the object) and
properties (data items particular to an object).
• Object persistence is the permanent storing of values of
properties of an object in secondary memory (usually disk)
Copyright © 2020, 2017, 2015 Pearson Education, Inc. All Rights Reserved
OODBMS
36
Learn the basic concepts of virtualization and
virtual machines
• Object-oriented DBMSs (OODBMSs) are special-purpose
DBMS products for storing data developed by several vendors
in the early 1990s. These never achieved commercial success.
– too expensive to transfer existing data from traditional
legacy databases
– not cost justifiable
Copyright © 2020, 2017, 2015 Pearson Education, Inc. All Rights Reserved
Virtualization
37
Learn the basic concepts of virtualization and
virtual machines
• Virtualization is using hardware and software to simulate
another hardware resource.
• Virtualization is done by having one physical computer host
one or more virtual computers, also known as virtual
machines.
• The host machine runs a special program known as a virtual
machine manager or hypervisor.
• There are two ways to implement hypervisors:
– “bare metal” or type 1 (used in large data centers)
– “hosted” or type 2 (used by students and others)
Copyright © 2020, 2017, 2015 Pearson Education, Inc. All Rights Reserved
Figure 7.21 Type 1 and Type 2
38
Hypervisors versus Containers
Copyright © 2020, 2017, 2015 Pearson Education, Inc. All Rights Reserved
Cloud Computing
39
Learn the basic concepts of cloud computing
• Cloud computing services are ultimately provided by large data
centers.
• Storage area networks (SANs) have dedicated network paths from
servers to disk arrays, where several physical disks are combined
together to act as a single, larger disk.
• Redundant arrays of independent disks (RAID) can be configured
for maximum access speed or for reliability.
• Three basic ways to lease cloud services:
– Software as a service (SaaS), Ex: Salesforce.com
– Platform as a service (PaaS), Ex: operating systems, software
development tools & system programs provided
– Infrastructure as a service (IaaS), Ex: Only hardware provided
and users manage their own software
Copyright © 2020, 2017, 2015 Pearson Education, Inc. All Rights Reserved
Not Only SQL
40
Learn the basic concepts of Big Data,
structured storage, and the MapReduce
process
• The Not only SQL (previously NoSQL) movement is a movement to
use non-relational databases.
• These databases are often a distributed, replicated database to
support large datasets.
Copyright © 2020, 2017, 2015 Pearson Education, Inc. All Rights Reserved
The CAP Theorem
41
Understand the limitations and tradeoffs of
replicated, partitioned stores as indicated by
the CAP (Consistency, Availability, Partition
Tolerance) theorem
• The CAP theorem defines three properties of distributed database
systems:
– Consistency means that all database replicas see the same data
at any given point in time
– Availability means that every request received by a server will
result in a response, as long as the network is available
– Partition tolerance means that the distributed database can
continue to operate even when the cluster is partitioned by
network failures into two or more disconnected sections
(partitions)
Copyright © 2020, 2017, 2015 Pearson Education, Inc. All Rights Reserved
Figure 7.22 The CAP Theorem—You
42
Can’t Have All Three at the Same
Time!
Copyright © 2020, 2017, 2015 Pearson Education, Inc. All Rights Reserved
Categories of NoSQL Database
43
Management Systems
Learn the basic concepts of Big Data,
structured storage, and the MapReduce
process
• Four categories of NoSQL databases include:
– Key-Value database
§ Dynamo, MemcacheDB, Redis
– Document database
§ Couchbase, ArangoDB, Mongo DB
– Column family database
§ Vertica, Apache Cassandra, Hbase
– Graph database
§ Neo4j, AllegroGraph, Titan
Copyright © 2020, 2017, 2015 Pearson Education, Inc. All Rights Reserved
Figure 7.23 An ArangoDB AQL Query
44
and JSON Results
Courtesy of ArangoDB
Copyright © 2020, 2017, 2015 Pearson Education, Inc. All Rights Reserved
MapReduce and Hadoop
45
Learn the basic concepts of Big Data,
structured storage, and the MapReduce
process
• MapReduce is a technique to analyze extremely large datasets
by breaking down tasks and assigning them to a cluster of
computers and then combine them into the final product of the
original task.
• Hadoop Distributed File System (HDFS) provides standard
file services to clustered servers so their file systems can
function as one distributed, replicated file system that can
support large-scale MapReduce processing.
Copyright © 2020, 2017, 2015 Pearson Education, Inc. All Rights Reserved
Figure 7.24 MapReduce
46
Copyright © 2020, 2017, 2015 Pearson Education, Inc. All Rights Reserved
Figure 7.25 Example of Apache Tez
47
Directed Acyclic Graph
Copyright © 2020, 2017, 2015 Pearson Education, Inc. All Rights Reserved
Copyright
48
This work is protected by United States copyright laws and is
provided solely for the use of instructors in teaching their
courses and assessing student learning. Dissemination or sale of
any part of this work (including on the World Wide Web) will
destroy the integrity of the work and is not permitted. The work
and materials from it should never be made available to students
except by instructors using the accompanying text in their
classes. All recipients of this work are expected to abide by these
restrictions and to honor the intended pedagogical purposes and
the needs of other instructors who rely on these materials.
Copyright © 2020, 2017, 2015 Pearson Education, Inc. All Rights Reserved