0% found this document useful (0 votes)
22 views13 pages

U1 - Data Warehouse Intro

The document provides an introduction to Data Warehousing, outlining its purpose as a centralized repository for business data analysis and reporting. It details the steps involved in data warehousing, the goals and benefits, and distinguishes between Knowledge Discovery in Databases (KDD) and Data Mining, along with their respective processes and techniques. Additionally, it covers various data mining techniques such as classification, clustering, regression, and visualization methods for representing knowledge.

Uploaded by

hydey472
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views13 pages

U1 - Data Warehouse Intro

The document provides an introduction to Data Warehousing, outlining its purpose as a centralized repository for business data analysis and reporting. It details the steps involved in data warehousing, the goals and benefits, and distinguishes between Knowledge Discovery in Databases (KDD) and Data Mining, along with their respective processes and techniques. Additionally, it covers various data mining techniques such as classification, clustering, regression, and visualization methods for representing knowledge.

Uploaded by

hydey472
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

Unit 1 - Data Warehouse Introduction

Prepared by: Varun Rao (Dean, Data Science & AI)


For: Data Science - 1st years

An idea on Data Warehouse


Data Warehouse is a relational database management system (RDBMS) that was
constructed to meet the requirements of transaction processing systems. It can be
loosely described as any centralized data repository which can be queried for business
benefits. A Data warehouse is typically used to connect and analyze business
data from heterogeneous sources. The data warehouse is the core of the BI
system which is built for data analysis and reporting.

Data warehouse system is also known by the following name:


● Decision Support System (DSS)
● Executive Information System
● Management Information System
● Business Intelligence Solution
● Analytic Application
● Data Warehouse

A Data Warehouse can be viewed as a data system with the following attributes:

● It is a database designed for investigative tasks, using data from various


applications.

● It supports a relatively small number of clients with relatively long interactions.

● It includes current and historical data to provide a historical perspective of


information.

● Its usage is read-intensive.

● It contains a few large tables.


Steps in Data Warehousing
The following steps are involved in the process of data warehousing:

1. Extraction of data – A large amount of data is gathered from various


sources.
2. Cleaning of data – Once the data is compiled, it goes through a
cleaning process. The data is scanned for errors, and any error found is
either corrected or excluded.
3. Conversion of data – After being cleaned, the format is changed from
the database to a warehouse format.
4. Storing in a warehouse – Once converted to the warehouse format,
the data stored in a warehouse goes through processes such as
consolidation and summarization to make it easier and more
coordinated to use. As sources get updated over time, more data is
added to the warehouse.

Goals of Data Warehousing


● To help reporting as well as analysis

● Maintain the organization's historical information

● Be the foundation for decision making.

Benefits of Data Warehouse

1. Understand business trends and make better forecasting decisions.

2. Data Warehouses are designed to perform enormous amounts of data.

3. The structure of data warehouses is more accessible for end-users to navigate,


understand, and query.

4. Queries that would be complex in many normalized databases could be easier to


build and maintain in data warehouses.

5. Data warehousing is an efficient method to manage demand for lots of


information from lots of users.

6. Data warehousing provides the capabilities to analyze a large amount of


historical data.

Difference between KDD and Data Mining

KDD (Knowledge Discovery in Databases) is a field of computer science, which includes the
tools and theories to help humans in extracting useful and previously unknown information
(i.e. knowledge) from large collections of digitized data. KDD consists of several steps, and
Data Mining is one of them. Data Mining is application of a specific algorithm in order to
extract patterns from data.

KDD is a computer science field specializing in extracting previously unknown and


interesting information from raw data. KDD is the whole process of trying to make sense
of data by developing appropriate methods or techniques. This process deals with low-
level mapping data into other forms that are more compact, abstract, and useful. This is
achieved by creating short reports, modeling the process of generating data, and
developing predictive models that can predict future cases.

What is Data Mining?

As mentioned above, Data Mining is only a step within the overall KDD process. There are
two major Data Mining goals as defined by the goal of the application, and they are namely
verification or discovery. Verification is verifying the user’s hypothesis about data, while
discovery is automatically finding interesting patterns. There are four major data mining
tasks: clustering, classification, regression, and association (summarization). Clustering is
identifying similar groups from unstructured data. Classification is learning rules that can be
applied to new data. Regression is finding functions with minimal error to model data. An
association is looking for relationships between variables.

KDD Process Steps


Knowledge discovery in the database process includes the following steps, such as:

1. Goal identification: Develop and understand the application domain and the
relevant prior knowledge and identify the KDD process's goal from the customer
perspective.

2. Creating a target data set: Selecting the data set or focusing on a set of
variables or data samples on which the discovery was made.

3. Data cleaning and preprocessing:Basic operations include removing noise if


appropriate, collecting the necessary information to model or account for noise,
deciding on strategies for handling missing data fields, and accounting for time
sequence information and known changes.

4. Data reduction and projection: Finding useful features to represent the data
depending on the purpose of the task. The effective number of variables under
consideration may be reduced through dimensionality reduction methods or
conversion, or invariant representations for the data can be found.

5. Matching process objectives: KDD with step 1 a method of mining in particular.


For example, summarization, classification, regression, clustering, and others.
6. Modeling and exploratory analysis and hypothesis selection: Choosing the
algorithms or data mining and selecting the method or methods to search for
data patterns. This process includes deciding which model and parameters may
be appropriate (e.g., definite data models are different models on the real vector)
and the matching of data mining methods, particularly with the general approach
of the KDD process (for example, the end-user might be more interested in
understanding the model in its predictive capabilities).

7. Data Mining: The search for patterns of interest in a particular representational


form or a set of these representations, including classification rules or trees,
regression, and clustering. The user can significantly aid the data mining method
to carry out the preceding steps properly.

8. Presentation and evaluation: Interpreting mined patterns, possibly returning to


some of the steps between steps 1 and 7 for additional iterations. This step may
also involve the visualization of the extracted patterns and models or
visualization of the data given the models drawn.

9. Taking action on the discovered knowledge: Using the knowledge directly,


incorporating the knowledge in another system for further action, or simply
documenting and reporting to stakeholders. This process also includes checking
and resolving potential conflicts with previously believed knowledge (or
extracted).

Stages of the Data Mining Process

Data Mining is a process of discovering various models, summaries, and derived


values from a given collection of data. The general experimental procedure
adapted to data-mining problem involves following steps :
1. State problem and formulate hypothesis – In this step, a modeler
usually specifies a group of variables for unknown dependency and, if
possible, a general sort of this dependency as an initial hypothesis.
There could also be several hypotheses formulated for one problem at
this stage. The primary step requires combined expertise of an
application domain and a data-mining model. In practice, it always
means an in-depth interaction between a data-mining expert and
application expert. In successful data-mining applications, this
cooperation does not stop within the initial phase. It continues during
the whole data-mining process.
2. Collect data – This step cares about how information is generated and
picked up. Generally, there are two distinct possibilities. The primary is
when the data-generation process is under control of an expert
(modeler). This approach is understood as a designed experiment. The
second possibility is when experts cannot influence the data generation
process. This is often referred to as the observational approach. An
observational setting, namely, random data generation, is assumed in
most data-mining applications. Typically, sampling distribution is totally
unknown after data are collected, or it is partially and implicitly given
within data-collection procedure. It is vital, however, to know how data
collection affects its theoretical distribution since such a piece of prior
knowledge is often useful for modeling and, later, for ultimate
interpretation of results. Also, it is important to be sure that information
used for estimating a model and therefore data used later for testing
and applying a model come from an equivalent, unknown, sampling
distribution. If this is often not the case, the estimated model cannot be
successfully utilized in a final application of results.
3. Data Preprocessing – In the observational setting, data is usually
“collected” from prevailing databases, data warehouses, and data
marts. Data preprocessing usually includes a minimum of two common
tasks :
○ (i) Outlier Detection (and removal) : Outliers are unusual
data values that are not according to most observations.
Commonly, outliers result from measurement errors, coding,
and recording errors, and, sometimes, are natural, abnormal
values. Such non-representative samples can seriously affect
models produced later. There are two strategies for handling
outliers : Detect and eventually remove outliers as a
neighborhood of preprocessing phase. And Develop robust
modeling methods that are insensitive to outliers.
○ (ii) Scaling, encoding, and selecting features : Data
preprocessing includes several steps like variable scaling and
differing types of encoding. For instance, one feature with
range [0, 1] and other with range [100, 1000] will not have an
equivalent weight within applied technique. They are going to
also influence ultimate data-mining results differently.
Therefore, it is recommended to scale them and convey both
features to an equivalent weight for further analysis. Also,
application-specific encoding methods usually achieve
dimensionality reduction by providing a smaller number of
informative features for subsequent data modeling.
4. Estimate model – The selection and implementation of acceptable
data-mining techniques is that main task during this phase. This
process is not straightforward. Usually, in practice, implementation is
predicated on several models, and selecting the simplest one is a
further task.
5. Interpret the model and draw conclusions – In most cases, data-
mining models should help in deciding. Hence, such models have to be
interpretable so as to be useful because humans are not likely to base
their decisions on complex “black-box” models. Note that goals of
accuracy of model and accuracy of its interpretation are somewhat
contradictory. Usually, simple models are more interpretable, but they
are also less accurate. Modern data-mining methods are expected to
yield highly accurate results using high dimensional models. The matter
of interpreting these models, also vital, is taken into account as a
separate task, with specific techniques to validate results.

Data Mining Techniques

1. Association

Association analysis is the finding of association rules showing attribute-value


conditions that occur frequently together in a given set of data. Association
analysis is widely used for a market basket or transaction data analysis.
Association rule mining is a significant and exceptionally dynamic area of data
mining research. One method of association-based classification, called
associative classification, consists of two steps. In the main step, association
instructions are generated using a modified version of the standard association
rule mining algorithm known as Apriori. The second step constructs a classifier
based on the association rules discovered.

2. Classification

Classification is the process of finding a set of models (or functions) that describe
and distinguish data classes or concepts, for the purpose of being able to use the
model to predict the class of objects whose class label is unknown. The
determined model depends on the investigation of a set of training data
information (i.e. data objects whose class label is known).
Data Mining has a different type of classifier:
● Decision Tree
● SVM(Support Vector Machine)
● Generalized Linear Models
● Bayesian classification:
● Classification by Backpropagation
● K-NN Classifier (K-nearest neighbor)
● Rule-Based Classification
● Frequent-Pattern Based Classification
● Rough set theory
● Fuzzy Logic

3. Prediction

Data Prediction is a two-step process, similar to that of data classification.


Although, for prediction, we do not utilize the phrasing of “Class label attribute”
because the attribute for which values are being predicted is consistently
valued(ordered) instead of categorical (discrete-esteemed and unordered). The
attribute can be referred to simply as the predicted attribute. Prediction can be
viewed as the construction and use of a model to assess the class of an
unlabeled object, or to assess the value or value ranges of an attribute that a
given object is likely to have.
4. Clustering

Unlike classification and prediction, which analyze class-labeled data objects or


attributes, clustering analyzes data objects without consulting an identified class
label. In general, the class labels do not exist in the training data simply because
they are not known to begin with. Clustering can be used to generate these
labels. The objects are clustered based on the principle of maximizing the intra-
class similarity and minimizing the inter-class similarity. That is, clusters of
objects are created so that objects inside a cluster have high similarity in contrast
with each other, but are different objects in other clusters. Each Cluster that is
generated can be seen as a class of objects, from which rules can be inferred.
Clustering can also facilitate classification formation, that is, the organization of
observations into a hierarchy of classes that group similar events together.

5. Regression

Regression can be defined as a statistical modeling method in which previously


obtained data is used to predict a continuous quantity for new observations. This
classifier is also known as the Continuous Value Classifier. There are two types
of regression models: Linear regression and multiple linear regression models.

6. Artificial Neural network (ANN) Classifier Method

An artificial neural network (ANN) also referred to as simply a “Neural Network”


(NN), could be a process model supported by biological neural networks. It
consists of an interconnected collection of artificial neurons. A neural network is a
set of connected input/output units where each connection has a weight
associated with it. During the knowledge phase, the network acquires by
adjusting the weights to be able to predict the correct class label of the input
samples. Neural network learning is also denoted as connectionist learning due
to the connections between units. Neural networks involve long training times
and are therefore more appropriate for applications where this is feasible.

7. Outlier Detection

A database may contain data objects that do not comply with the general
behavior or model of the data. These data objects are Outliers. The investigation
of OUTLIER data is known as OUTLIER MINING. An outlier may be detected
using statistical tests which assume a distribution or probability model for the
data, or using distance measures where objects having a small fraction of “close”
neighbors in space are considered outliers.

Knowledge Representation

Knowledge representation is the presentation of knowledge to the user for


visualization in terms of trees, tables, rules graph, charts, matrices, etc.
For Example: Histograms

Histograms
● Histogram provides the representation of a distribution of values of a
single attribute.
● It consists of a set of rectangles that reflects the counts or frequencies of
the classes present in the given data.

. Geometric projection visualization technique


Techniques used to find geometric transformation are:

i. Scatter-plot matrices
It consists of scatter plots of all possible pairs of variables in a dataset.

ii. Hyper slice

It is an extension to scatter-plot matrices. They represent multidimensional


function as a matrix of orthogonal two dimensional slices.

iii. Parallel coordinates

● The parallel vertical lines which are separated define the axes.
● A point in the Cartesian coordinates corresponds to a polyline in parallel
coordinates.
. Icon-based visualization techniques
● Icon-based visualization techniques are also known as iconic display
techniques.
● Each multidimensional data item is mapped to an icon.
● This technique allows visualization of large amounts of data.
● The most commonly used technique is Chernoff faces.

Some of the visualization techniques are:

i. Dimensional stacking

● In dimension stacking, n-dimensional attribute space is partitioned in 2-


dimensional subspaces.
● Attribute values are partitioned into various classes.
● Each element is two dimensional space in the form of xy plot.
● Helps to mark the important attributes and are used on the outer level.

ii. Mosaic plot

● Mosaic plot gives the graphical representation of successive


decompositions.
● Rectangles are used to represent the count of categorical data and at
every stage, rectangles are split parallel.

iii. Worlds within worlds

● Worlds within worlds are useful to generate an interactive hierarchy of


display.
● Innermost word must have a function and two most important
parameters.
● Remaining parameters are fixed with the constant value.
● Through this, N-vision of data are possible like data glove and stereo
displays, including rotation, scaling (inner) and translation (inner/outer).
● Using queries, static interaction is possible.

iv. Tree maps


● Tree maps visualization techniques are well suited for displaying large
amount of hierarchical structured data.
● The visualization space is divided into the multiple rectangles that are
ordered, according to a quantitative variable.
● The levels in the hierarchy are seen as rectangles containing the other
rectangle.
● Each set of rectangles on the same level in the hierarchy represents a
category, a column or an expression in a data set.

v. Visualization complex data and relations

● This technique is used to visualize non-numeric data.


For example: text, pictures, blog entries and product reviews.
● A tag cloud is a visualization method which helps to understand the
information of user generated tags.
● It is also possible to arrange the tags alphabetically or according to the
user preferences with different font sizes and colors.

You might also like