DWDM QB
DWDM QB
Sourcing: Collects data from diverse operational sources (databases, flat files, APIs).
Acquisition: Moves data into staging areas or directly into the warehouse.
● Top-Down Approach (Inmon): Build a central enterprise warehouse first, then create
data marts for departments. Ensures integration and consistency.
● Bottom-Up Approach (Kimball): Develop data marts for individual departments first,
then integrate them into a warehouse. Faster to implement and more flexible.
● Hybrid Approach: Combines both methods to leverage their strengths—fast
development and integrated architecture.
Top-Down Approach: This method starts with designing the overall data warehouse
architecture first and then creating individual data marts.
Bottom-Up Approach: In this method, data marts are built first to meet specific business needs,
and later integrated into a central data warehouse.
Before diving deep into these approaches, we will first discuss the components of data
warehouse architecture.
External Sources: External sources are where data originates. These sources provide a variety
of data types, such as structured data (databases, spreadsheets); semi-structured data (XML,
JSON) and unstructured data (emails, images).
Staging Area: The staging area is a temporary space where raw data from external sources is
validated and prepared before entering the data warehouse. This process ensures that the data
is consistent and usable. To handle this preparation effectively, ETL (Extract, Transform,
Load) tools are used.
Extract (E): Pulls raw data from external sources.
Transform (T): Converts raw data into a standard, uniform format.
Load (L): Loads the transformed data into the data warehouse for further processing.
Data Warehouse: The data warehouse acts as the central repository for storing cleansed and
organized data. It contains metadata and raw data. The data warehouse serves as the
foundation for advanced analysis, reporting, and decision-making.
Data Marts: A data mart is a subset of a data warehouse that stores data for a specific team or
purpose, like sales or marketing. It helps users quickly access the information they need for
their work.
Data Mining: Data mining is the process of analyzing large datasets stored in the data
warehouse to uncover meaningful patterns, trends, and insights. The insights gained can
support decision-making, identify hidden opportunities, and improve operational efficiency.
Top-Down Approach
The Top-Down Approach, introduced by Bill Inmon, is a method for designing data
warehouses that starts by building a centralized, company-wide data warehouse. This central
repository acts as the single source of truth for managing and analyzing data across the
organization. It ensures data consistency and provides a strong foundation for decision-
making.
Working of Top-Down Approach
Central Data Warehouse: The process begins with creating a comprehensive data warehouse
where data from various sources is collected, integrated, and stored. This involves the ETL
(Extract, Transform, Load) process to clean and transform the data.
Specialized Data Marts: Once the central warehouse is established, smaller, department-
specific data marts (e.g., for finance or marketing) are built. These data marts pull information
from the main data warehouse, ensuring consistency across departments.
Top down approach
2. Improved Data Consistency: By sourcing all data marts from a single data warehouse, the
approach promotes standardization. This reduces the risk of errors and inconsistencies in
reporting, leading to more reliable business insights.
4. Better Scalability: The approach is highly scalable, allowing organizations to add new data
marts seamlessly as their needs grow or evolve. This is particularly beneficial for businesses
experiencing rapid expansion or shifting demands.
6. Reduced Data Duplication: Storing data only once in the central warehouse minimizes
duplication, saving storage space and reducing inconsistencies caused by redundant data.
7. Improved Reporting: A consistent view of data across all data marts enables more accurate
and timely reporting. This enhances decision-making and helps drive better business
outcomes.
8. Better Data Integration: With all data marts being sourced from a single warehouse,
integrating data from multiple sources becomes easier. This provides a more comprehensive
view of organizational data and improves overall analytics capabilities.
3. Lack of Flexibility: Since the data warehouse and data marts are designed in advance,
adapting to new or changing business requirements can be difficult. This lack of flexibility
may not suit organizations that require dynamic and agile data reporting capabilities.
4. Limited User Involvement: The Top-Down Approach is often led by IT departments, which
can result in limited involvement from business users. This may lead to data marts that fail to
address the specific needs of end-users, reducing their overall effectiveness.
5. Data Latency: When data is sourced from multiple systems, the Top-Down Approach may
introduce delays in data processing and availability. This latency can affect the timeliness and
accuracy of reporting and analysis.
6. Data Ownership Challenges: Centralizing data in the data warehouse can create ambiguity
around data ownership and responsibilities. It may be unclear who is accountable for
maintaining and updating the data, leading to potential governance issues.
7. Integration Challenges: Integrating data from diverse sources with different formats or
structures can be difficult in the Top-Down Approach. These challenges may result in
inconsistencies and inaccuracies in the data warehouse.
8. Not Ideal for Smaller Organizations: Due to its high cost and resource requirements, the
Top-Down Approach is less suitable for smaller organizations or those with limited budgets
and simpler data needs.
Bottom-Up Approach
The Bottom-Up Approach, popularized by Ralph Kimball, takes a more flexible and
incremental path to designing data warehouses. Instead of starting with a central data
warehouse, it begins by building small, department-specific data marts that cater to the
immediate needs of individual teams, such as sales or finance. These data marts are later
integrated to form a larger, unified data warehouse.
3. User Involvement: The Bottom-Up Approach encourages active involvement from business
users during the design and implementation process. Users can provide feedback on data marts
and reports, ensuring the solution meets their specific needs.
4. Flexibility: This approach is highly flexible, as data marts are designed based on the unique
requirements of specific business functions. It is particularly beneficial for organizations that
require dynamic and customizable reporting and analysis.
5. Faster Time to Value: With quicker implementation compared to the Top-Down Approach,
the Bottom-Up Approach delivers faster time to value. This is especially useful for smaller
organizations with limited resources or businesses looking for immediate results.
6. Reduced Risk: By creating and refining individual data marts before integrating them into a
larger data warehouse, this approach reduces the risk of failure. It also helps identify and
resolve data quality issues early in the process.
7. Scalability: The Bottom-Up Approach is scalable, allowing organizations to add new data
marts as needed. This makes it an ideal choice for businesses experiencing growth or
undergoing significant change.
8. Clarified Data Ownership: Each data mart is typically owned and managed by a specific
business unit, which helps clarify data ownership and accountability. This ensures data
accuracy, consistency, and proper usage across the organization.
9. Lower Cost and Time Investment: Compared to the Top-Down Approach, the Bottom-Up
Approach requires less upfront cost and time to design and implement. This makes it an
attractive option for organizations with budgetary or time constraints.
2. Data Silos: This approach can result in the creation of data silos, where different business
units develop their own data marts independently. This lack of coordination may cause
redundancies, data inconsistencies, and difficulties in integrating data across the organization.
3. Integration Challenges: Integrating multiple data marts into a unified data warehouse can be
challenging. Differences in data structures, formats, and granularity may lead to issues with
data quality, accuracy, and consistency.
5. Lack of Enterprise-Wide View: Since data marts are typically designed to meet the needs of
specific departments, this approach may not provide a comprehensive, enterprise-wide view of
data. This limitation can hinder strategic decision-making and limit an organization’s ability to
analyze data holistically.
6. Complexity in Management: Managing and maintaining multiple data marts with varying
complexities and granularities can be more challenging compared to a centralized data
warehouse. This can lead to higher maintenance efforts and potential difficulties in ensuring
long-term scalability.
7. Risk of Inconsistency: The decentralized nature of the Bottom-Up Approach increases the
risk of data inconsistency. Differences in data structures and definitions across data marts can
make it difficult to compare or combine data, reducing the reliability of reports and analyses.
Relational OLAP
ROLAP servers are placed between relational back-end server and client front-end tools. To
store and manage warehouse data, ROLAP uses relational or extended-relational DBMS.
Multidimensional OLAP
Hybrid OLAP
Hybrid OLAP is a combination of both ROLAP and MOLAP. It offers higher scalability of
ROLAP and faster computation of MOLAP. HOLAP servers allows to store the large data
volumes of detailed information. The aggregations are stored separately in MOLAP store.
Specialized SQL Servers
Specialized SQL servers provide advanced query language and query processing support for
SQL queries over star and snowflake schemas in a read-only environment.
OLAP Operations
Since OLAP servers are based on multidimensional view of data, we will discuss OLAP
operations in multidimensional data.
Roll-up
Drill-down
Pivot (rotate)
Roll-up
By dimension reduction
Initially the concept hierarchy was "street < city < province < country".
On rolling up, the data is aggregated by ascending the location hierarchy from the level of city
to the level of country.
When roll-up is performed, one or more dimensions from the data cube are removed.
Drill-down
Drill-down is performed by stepping down a concept hierarchy for the dimension time.
Initially the concept hierarchy was "day < month < quarter < year."
On drilling down, the time dimension is descended from the level of quarter to the level of
month.
When drill-down is performed, one or more dimensions from the data cube are added.
It navigates the data from less detailed data to highly detailed data.
Slice
The slice operation selects one particular dimension from a given cube and provides a new
sub-cube. Consider the following diagram that shows how slice works.
Here Slice is performed for the dimension "time" using the criterion time = "Q1".
It will form a new sub-cube by selecting one or more dimensions.
Dice
Dice selects two or more dimensions from a given cube and provides a new sub-cube.
Consider the following diagram that shows the dice operation.
The dice operation on the cube based on the following selection criteria involves three
dimensions.
Pivot
The pivot operation is also known as rotation. It rotates the data axes in view in order to
provide an alternative presentation of data. Consider the following diagram that shows the
pivot operation.
OLAP vs OLTP
2 OLAP systems are used by knowledge OLTP systems are used by clerks,
workers such as executives, managers DBAs, or database professionals.
and analysts.
OLAP operations:
There are five basic analytical operations that can be performed on an OLAP cube:
13. Describe Data mining techniques and its issues,applications
4) Active learning :
In this approach user is asked to label an example.
Based on knowledge from human users the model quality is optimized.
2.3.3 Database Systems and Data Warehouses
To create, maintain and use the databases for organizations and end user, is the
aim.
Various techniques like query languages, query processing and optimization
methods, data storage and indexing and accessing methods are used to facilitate
this.
A data warehouse integrates data from various sources and timeframes.
2.3.4 Information Retrieval
Documents or information from the documents are searched by this technique.
The system assumes that -
1) The data to be searched is unstructured.
2) The queries are formed by keywords.
Language model is the probability density function which is used to generate bag
of words.
The topic in text documents can also be modeled as probability distribution over
vocabulary known as topic model.
Noise/outliers
Positive example
Negative example
Unlabeled example
To handle such a varied data effective and efficient data mining tools are
needed.
2. Mining dynamic, networked and global data repositories :
The knowledge discovery from structured, semi-structured or unstructured
interconnected data is very challenging.
Web mining, multisource data mining techniques are needed to handle this data.
2.4.5 Data Mining and Society
There are many issues which need to addressed while using data mining in day to
day life.
Some of these are :
1. Social impact of data mining
2. Privacy preserving data mining
3. Invisible data mining
2.5 Applications of Data Mining
Data mining has many applications in various domains.
In this section we are going to discuss two successful applications of data mining
1. Business intelligence
2. Search engines
2.5.1 Business Intelligence
The term Business Intelligence (BI) refers to technologies, applications and
practices for the collection, integration, analysis and presentation of business
information.
The purpose of BI is to support better business decision making.
Data mining is required in BI to perform effective market analysis, compare
customer feedback on similar products to discover strengths and weakness of their
competitors to retain highly valuable customers and to make smart business
decisions.
Data warehousing and multidimensional data mining is used in online analytical
processing tool.
Classification and prediction techniques are used in predictive analytics.
Clustering is used in customer relationship management.
Mean
The first measure is the mean which means average.
To calculate the mean add together all of the numbers in your data set.
Then divide that sum by the number of addends.
Let x x1 2 ...xN be a set of N values or observations such as for some numeric
attribute X then The mean of this data is
Arithmetic Mean or the Weighted Average
Sometimes each value xi in a set may be associated with a weight wi for i = 1...N.
The weights reflect the significance importance or occurrence frequency attached to
their respective values.
The weighted arithmetic mean or the weighted average is calculated as
For the given data set of items the quantity of the items is considered as weight w.
Hence the arithmetic mean is calculated as
To offset the effect caused by a small number of extreme values one can use
trimmed mean.
The trimmed mean is obtained after chopping off values at the high and low
extremes.
Median
Another measure of central tendency is the which is the middle number when
listed in order from least to greatest.
For skewed (asymmetric) data a better measure of the center of data is the median.
It is the middle value in a set of ordered data values.
The median is the value that separates the higher half of a data set from the lower
half
To calculate median
Suppose that a given data set of N values for an attribute X is sorted in
increasing order.
If N is odd then the median is the middle value of the ordered set.
If N is even then the median is not unique; it is the two middlemost values and
any value in between.
If X is a numeric attribute in this case by convention the median is taken as the
average of the two middlemost values.
For the sample dataset given we can find the median of the ‘Cost of items’ which
has a total 12 values. 12 ia an even number. Hence the median is taken as average
of middlemost values i.e. average of 52 and 56. Hence median is 54.
The median is expensive to compute when we have a large data set then following
formula is used to calculate median as
where
L1 is the lower boundary of the median interval
N is the number of values in the entire data set
freq l is the sum of the frequencies of all of the intervals that are lower than the
median interval
freq median is the frequency of the median interval
Width is the width of the median interval.
Mode
The mode is another measure of central tendency.
The mode for a set of data is the value that occurs most frequently in the set.
Therefore, it can be determined for qualitative and quantitative attributes.
It is possible for the greatest frequency to correspond to several different values,
which results in more than one mode.
Data sets with one, two, or three modes are respectively called unimodal, bimodal,
and trimodal.
In general, a data set with two or more modes is multimodal. At the other extreme,
if each data value occurs only once, then there is no mode.
For unimodal numeric data that are moderately skewed (asymmetrical), we have the
following empirical relation -
The midrange can also be used to assess the central tendency of a numeric data set.
It is the average of the largest and smallest values in the set
In a unimodal frequency curve with perfect symmetric data distribution, the mean,
median, and mode are all at the same center value as shown in Fig. 2.7.1
Data in most real applications are not symmetric. They may instead be either
positively skewed, where the mode occurs at a value that is smaller than the
median, or negatively skewed, where the mode occurs at a value greater than the
median (Refer Fig. 2.7.1).
Measuring the Dispersion of Data
(a) Symmetric data (b) Positively skewed data (c) Negatively skewed data
Fig 2.7.1 Mean, median, and mode of symmetric versus positively and negatively
skewed data
The 4-quantiles are the three data points that split the data distribution into four
equal parts; each part represents one-fourth of the data distribution. They are more
commonly referred to as quartiles.
The 100-quantiles are more commonly referred to as percentiles; they divide the data
distribution into 100 equal-sized consecutive sets. The median, quartiles, and
percentiles are the most widely used forms of quantiles.
Interquartile Range
The quartiles give an indication of a distribution’s center, spread, and shape. The
first quartile, denoted by Q1, is the 25th percentile.
It cuts off the lowest 25 % of the data. The third quartile, denoted by Q3, is the 75th
percentile - it cuts off the lowest 75 % (or highest 25 %) of the data.
The second quartile is the 50th percentile. As the median, it gives the center of the
data distribution.
The distance between the first and third quartiles is a simple measure of spread that
gives the range covered by the middle half of the data. This distance is called the
interquartile range (IQR) and is defined as IQR = Q3 Q1.
Outliers
In statistics, an outlier is a data
point that differs significantly from
other observations or patterns.
(Refer Fig. 2.7.3)
An outlier is an observation that lies
an abnormal distance from other
values in a random sample from a
population.
measures spread about the mean and should be considered only when the mean is
chosen as the measure of center.
= 0 only when there is no spread, that is, when all observations have the same
value. Otherwise, > 0.
The standard deviation is a good indicator of the spread of a data set.
The computation of the variance and standard deviation is scalable in large
databases.
Graphic Displays of Basic Statistical Descriptions of Data
Graphs are helpful for the visual inspection of data, which is useful for data
preprocessing.
The first three of these show univariate distributions (i.e., data for one attribute),
while scatter plots show bivariate distributions (i.e., involving two attributes).
Quantile Plots
A quantile plot is a simple and effective way to have a first look at a univariate data
distribution.
First, it displays all of the data for the given attribute (allowing the user to assess
both the overall behavior and unusual occurrences).
Second, it plots quantile information
Let xi , for i = 1 to N, be the data sorted in increasing order so that x1 is the
smallest observation.
xN is the largest for some ordinal or numeric attribute X.
Each observation, xi is paired with a percentage, fi which indicates that
Note that the 0.25 percentile corresponds to quartile Q1, the 0.50 percentile is the
median, and the 0.75 percentile is Q3.
These numbers increase in equal steps of 1/N, ranging from 1 2N (which is slightly
above 0) to 1-12N (which is slightly below 1).
On a quantile plot, xi is graphed against fi
. This allows us to compare different
Let x1, ..., xN be the data from the first branch, and y1,..., yM be the data from the
second, where each data set is sorted in increasing order.
If M = N (i.e., the number of points in each set is the same), then we simply plot yi
against xi , where yi and xi are both (i - 0.5)/N quantiles of their respective data
sets.
If M < N (i.e., the second branch has fewer observations than the first), there can be
only M points on the q-q plot. Here, yi is the (i - 0.5)/M quantile of the y data,
which is plotted against the (i - 0.5)/M quantile of the x data. This computation
typically involves interpolation.
Histograms
Histograms (or frequency histograms) are at least a century old and are widely used.
“Histos” means pole or mast, and “gram” means chart, so a histogram is a chart of
poles.
Plotting histograms is a graphical method for summarizing the distribution of a
given attribute, X.
If X is nominal, such as automobile model or item type, then a pole or vertical bar is
drawn for each known value of X.
The height of the bar indicates the frequency (i.e., count) of that X value. The
resulting graph is more commonly known as a bar chart.
If X is numeric, the term histogram is preferred.
The range of values for X is partitioned into disjoint consecutive subranges.
The subranges, referred to as buckets or bins, are disjoint subsets of the data
distribution for X.
The range of a bucket is known as the width.
Scatter Plots
A scatter plot is one of the most effective graphical methods for determining if there
appears to be a relationship, pattern, or trend between two numeric attributes.
If the pattern of plotted points slopes from upper left to lower right, the values
of X increase as the values of Y decrease, suggesting a negative correlation
Fig. 2.7.9 (b).
A line of best fit can be drawn to study the correlation between the variables.
Refer Fig. 2.7.10 shows three cases for which there is no correlation relationship
between the two attributes in each of the given data sets.
Conclusion
Basic data descriptions (e.g., measures of central tendency and measures of
dispersion) and graphic statistical displays (e.g., quantile plots, histograms, and
scatter plots) provide valuable insight into the overall behavior of your data. By
identification of noise and outliers, these techniques are useful for data cleaning.