Data Mining
— Introduction —
1
Why Data Mining?
The Explosive Growth of Data(abundant data): from terabytes
to petabytes
Data collection and data availability
Automated data collection tools, database systems,
Web, computerized society
We are drowning in data, but starving for knowledge!
“Necessity is the mother of invention”—Data mining—
Automated analysis of massive data sets
2
define Data Mining?
Sifting through very large amounts of data for useful
information. Data mining uses artificial intelligence
techniques, neural networks, and advanced statistical tools
(such as cluster analysis) to reveal trends, patterns, and
relationships, which might otherwise have remained
undetected. In contrast to an expert system (which draws
inferences from the given data on the basis of a given set of
rules) data mining attempts to discover hidden rules
underlying the data. Also called data surfing.
3
Data Mining Techniques
The most commonly used techniques in data mining are:
1- Artificial neural networks: Non-linear predictive models that
learn through training and resemble biological neural networks
in structure.
2- Decision trees: Tree-shaped structures that represent sets of
decisions. These decisions generate rules for the classification
of a dataset..
4
Data Mining Techniques
3- Genetic algorithms: Optimization techniques that use
processes such as genetic combination, mutation, and natural
selection in a design based on the concepts of evolution.
4-Nearest neighbor method: A technique that classifies each
record in a dataset based on a combination of the classes of
the k record(s) most similar to it in a historical dataset (where
k ³ 1). Sometimes called the k-nearest neighbor technique.
5- Rule induction: The extraction of useful if-then rules from data
based on statistical significance
5
Applications of Data Mining
There is a rapidly growing body of successful applications in
a wide range of areas as diverse as:
analysis of organic compounds
weather forecasting
predicting share of television audiences
medical diagnosis
financial forecasting
automatic abstracting
credit card fraud detection
targeted marketing
electric load prediction
toxic hazard analysis
6
Application examples
and many more. Some examples of applications (potential or
actual) are:
1– a supermarket chain mines its customer transactions data to
optimise targeting of high value customers.
2– a credit card company can use its data warehouse of
customer transactions for fraud detection.
3– a major hotel chain can use survey databases to identify
attributes of a 'high-value’ prospect
4– predicting the probability of default for consumer loan
applications by improving the ability to predict bad loans.
7
Application examples
5– reducing fabrication flaws in VLSI chips.
6– data mining systems can sift through vast quantities of data
collected during the semiconductor fabrication process to
identify conditions that are causing yield problems.
7– predicting audience share for television programmers ,
allowing television executives to arrange show schedules to
maximize market share and increase advertising revenues
8– predicting the probability that a cancer patient will respond
to chemotherapy,thus reducing health-care costs without
affecting quality of care.
8
Knowledge Discovery in Databases (KDD)
Process
The KDD process is defined as: the nontrivial process of
identifying
valid, novel, potentially useful, and ultimately
understandable (comprehensible) patterns in data”, [ Fayyad
et al.(1996)].
Valid: are the discovered patterns representative of the data.
Novel: are the discovered patterns new to the organization.
Useful: can the organization use the discovered patterns.
Comprehensible: can we understand the discovered patterns.
9
Knowledge Discovery (KDD) Process
Data mining—core of Pattern Evaluation
knowledge discovery
process
Data Mining
Task-relevant Data
Data Selection and
Warehouse Transformation
Data Cleaning
Data Integration
Databases
10
KDD Process: Several Key
Steps
1. Preprocessing steps:-
Data cleaning (to remove noise and inconsistent data).
Data integration (where multiple data sources may be
combined).
Data transformation( where data transformed into
appropriate for mining).
2. Data mining( an essential process where intelligent
methods are applied in order to extract data patterns).
3. Post-processing steps:-
Pattern evaluation (to identify the truly interesting patterns)
knowledge presentation( present the mined knowledge to
the user -rules, tables, pie/bar chart, concept hierarchy, trees
etc.)
11
What Is Data Mining?
Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge
from huge amount of data
Data mining: a misnomer?
Alternative names
Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
dredging, information harvesting, etc.
12
Data Mining: Confluence of Multiple
Disciplines
Database
Technology Statistics
Machine Visualization
Learning Data Mining
Pattern
Recognition Other
Algorithm Disciplines
13
Data Mining Functionalities
General functionality
Descriptive data mining
Find human-interpretable patterns that describe
the data.
Predictive data mining
Use some variables to predict unknown or future
values of other variables.
14
Data Mining Tasks…
Classification [Predictive]
Clustering [Descriptive]
Association Rule Discovery [Descriptive]
Regression [Predictive]
Deviation Detection [Predictive]
Classification: Definition
Given a collection of records (training set )
Each record contains a set of attributes, one of
the attributes is the class.
Find a model for class attribute as a
function of the values of other attributes.
Goal: previously unseen records should be
assigned a class as accurately as possible.
A test set is used to determine the accuracy of
the model. Usually, the given data set is
divided into training and test sets, with training
set used to build the model and test set used
to validate it.
Classification Example
cal cal us
i i o
gor gor inu
a te a te ont a ss
c c c cl
Tid Refund Marital Taxable Refund Marital Taxable
Status Income Cheat Status Income Cheat
1 Yes Single 125K No No Single 75K ?
2 No Married 100K No Yes Married 50K ?
3 No Single 70K No No Married 150K ?
4 Yes Married 120K No Yes Divorced 90K ?
5 No Divorced 95K Yes No Single 40K ?
6
7
No
Yes
Married
Divorced 220K
60K No
No
10
No Married 80K ?
Test
8 No Single 85K Yes Set
9 No Married 75K No
Training
Learn
10 No Single 90K Yes Model
10
Set Classifier
Clustering Definition
Given a set of data points, each having a
set of attributes, and a similarity measure
among them, find clusters such that
Data points in one cluster are more similar to
one another.
Data points in separate clusters are less similar
to one another.
Association Rule Discovery:
Definition
Association rule mining searches for interesting relationships
among items in a given dataset.
Which items are frequently purchased by my customers?
Market basket analyst.
TID Items
{Milk}→
Rules
RulesDiscovered:
→{Coke}(
1 Bread, Coke, Milk Discovered:
Milk}→{Beer}
{Milk}
→{Beer}
2 Beer, Bread {Coke}(support=0.6%, confidence=0.75%
support=0.6%, confidence=0.75
{Diaper,
{Diaper,Milk}
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
If a customer buys diaper and milk, then he is very
likely to buy beer.
So, don’t be surprised if you find six-packs stacked
next to diapers!
Regression
Predict a value of a given continuous valued
variable based on the values of other variables,
assuming a linear or nonlinear model of
dependency.
Greatly studied in statistics, neural network fields.
Examples:
Predicting sales amounts of new product based
on advetising expenditure.
Predicting wind velocities as a function of
temperature, humidity, air pressure, etc.
Time series prediction of stock market indices.
Deviation/Anomaly Detection
Detect significant deviations from normal
behavior
Applications:
Credit Card Fraud Detection
Network Intrusion
Detection
Are All the “Discovered” Patterns
Interesting?
Data mining may generate thousands of patterns: Not all of
them are interesting
Interestingness measures
A pattern is interesting if it is easily understood by humans, valid
on new or test data with some degree of certainty, potentially
useful, novel, or validates some hypothesis that a user seeks to
confirm
Objective vs. subjective interestingness measures
Objective(data driven): based on statistics and structures of
patterns, e.g., support, confidence(degree of certainty), etc.
Subjective(user driven): based on user’s belief in the data, e.g.,
unexpectedness(contradicting a user’s belief), novelty(previously
unknown), actionability(Use of discovered knowledge), etc…
22
Pattern Interestingness Measure
Simplicity
e.g., (association) rule length, (decision) tree size
Certainty ( A → B)
e.g., confidence= #(A and B)/ #(A), classification
reliability or accuracy, certainty factor, rule strength, rule
quality,
Support = #(A and B)/ #(Domain),
Coverage= #(A and B)/ #(B),
Novelty
not previously known, surprising.
23