DATA MINING
• It is the process of analyzing data from different
perspectives and summarizing it into useful
information - information that can be used to
increase revenue, cuts costs, or both.
(http://www.anderson.ucla.edu)
• Also defined as the process of extracting valid
previously unknown comprehensible and actionable
information from large databases and using it to
make crucial business decisions.(Conolly & Begg,
2005)
� Technically, it is a process of discovering
meaningful patterns and relationships that lie
hidden within very large databases(Seidman,
2001)
� Refers to the mining or discovery of new
information in terms of patterns or rules from
vast amounts of data
� Keyword here is patterns:
So what is a pattern??
� A set of events that occur with enough frequency
in the dataset to reveal a relationship between
them. Revealing the relationship is usually an
inductive reasoning process
THE MATHEMATICS OF DATA MINING
� Mathematicians have provided an ideal
framework within which to conduct data mining
called the “EUCLIDEAN SPACE” and the
mathematical theory describing it is known as
linear algebra
� So what is the Euclidean space??
PREDICTION
CLASSIFICATION GOALS OF DATA MINING
OPTIMIZATION
IDENTIFICATION
STYLES TO DATA MINING
• Directed data mining- takes the form of predictive
modelling where we know exactly what we want to
predict
• It classifies data for use in making predictions or
estimates with the goal of deriving target values
• Egs banks may use it to predict defaulters on loans,
businesses may use it to decide whom to market their
products to
• Uses popular data mining algorithms such as
decision trees(which will be discussed later on in detail)
� Undirected data mining- which finds patterns
in the data and leaves it up to the user to
determine whether or not these patterns are
important
� Data is placed in a format that makes it easier
for us to make sense of it
� Most commonly used algorithm is clustering
which clumps data together in groups based on
common characteristics(to be discussed later in detail)
� One can then take one of the derived clusters
and apply the decision tree algorithm to it so
that they focus on a particular segment of the
cluster
DATA MINING METHODOLOGY
DATA MINING ALGORITHMS
� A data mining algorithm is a well-defined
procedure that takes data as input and produces as
output: models or patterns
DECISION TREES
� This algorithm analyzes the data and creates a
repeating series of branches until no more
relevant branches can be made
� The end result is a binary tree structure where the
splits in the branches can be followed along
specific criteria to find the most desired result
� Decision Tree (DT):
�Tree where the root and each internal node is labeled
with a question.
�The arcs represent each possible answer to the
associated question.
�Each leaf node represents a prediction of a solution to
the problem.
� Popular technique for classification; Leaf node
indicates class to which the corresponding tuple
belongs.
CLUSTERING
� This algorithm groups data into clusters
� The goal of clustering is to place records into
groups, such that records in a group are similar
to each other and dissimilar to records in other
groups
� An important facet of clustering is the
similarity function that is used
� The Euclidean distance(the ordinary or straight
line distance between two points) can be used
to measure similarity
ASSOCIATION RULE MINING
� It is an important data mining model initially
used for Market Basket Analysis to find how
items purchased by customers are related
ASSOCIATION RULE MINING
� Given a set of transactions, find rules that will predict the
occurrence of an item based on the occurrences of other items
in the transaction
Market-Basket transactions
Example of Association Rules
{Diaper} → {Beer},
{Milk, Bread} → {Eggs,Coke},
{Beer, Bread} → {Milk},
Implication means co-occurrence,
not causality!
DEFINITION: ASSOCIATION RULE
● Association Rule
– An implication expression of the form
X → Y, where X and Y are itemsets
– Example:
{Milk, Diaper} → {Beer}
● Rule Evaluation Metrics
– Support (s)
◆ Fraction of transactions that contain Example
both X and Y :
– Confidence (c)
◆ Measures how often items in Y
appear in transactions that
contain X
MINING ASSOCIATION RULES
Example of Rules:
{Milk,Diaper} → {Beer} (s=0.4, c=0.67)
{Milk,Beer} → {Diaper} (s=0.4, c=1.0)
{Diaper,Beer} → {Milk} (s=0.4, c=0.67)
{Beer} → {Milk,Diaper} (s=0.4, c=0.67)
{Diaper} → {Milk,Beer} (s=0.4, c=0.5)
{Milk} → {Diaper,Beer} (s=0.4, c=0.5)
Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements
CROSS INDUSTRY STANDARD PROCESS FOR DATA
MINING (CRISP- DM)
CRISP-DM: OVERVIEW
� CRISP-DM is a comprehensive data mining
methodology and process model that provides
anyone—from novices to data mining experts—
with a complete blueprint for conducting a data
mining project.
� CRISP-DM breaks down the life cycle of a data
mining project into six phases.
CRISP-DM: PHASES
Business Understanding
� Understanding project objectives and
requirements; Data mining problem definition
Data Understanding
Initial data collection and familiarization; Identify
data quality issues; Initial, obvious results
Data Preparation
� Record and attribute selection; Data cleansing
Modeling
� Run the data mining tools
Evaluation
� Determine if results meet business objectives;
Identify business issues that should have been
addressed earlier
Deployment
� Put the resulting models into practice; Set up for
continuous mining of the data