Data Mining
Data Mining Overview
• Data warehouses and OLAP (On Line Analytical Processing.)
• Association Rules Mining
• Clustering: Hierarchical and Partition approaches
• Classification: Decision Trees and Bayesian classifiers
• Sequential Pattern Mining
• Advanced topics: graph mining, privacy preserving data
mining, outlier detection, spatial data mining
What is Data Mining?
• Data Mining is:
(1) The efficient discovery of previously
unknown, valid, potentially useful,
understandable patterns in large datasets
(2) The analysis of (often large) observational
data sets to find unsuspected relationships
and to summarize the data in novel ways that
are both understandable and useful to the
data owner
Overview of terms
• Data: a set of facts (items) D, usually stored in
a database
• Pattern: an expression E in a language L, that
describes a subset of facts
• Attribute: a field in an item i in D.
• Interestingness: a function ID,L that maps an
expression E in L into a measure space M
Overview of terms
• The Data Mining Task:
For a given dataset D, language of facts L,
interestingness function ID,L and threshold c,
find the expression E such that ID,L(E) > c
efficiently.
Knowledge Discovery
Steps of a KDD Process
• Learning the application domain
– Relevant prior knowledge and goals of application
• Creating a target data set: data selection
• Data cleaning and preprocessing: (may take 60% of effort!)
• Data reduction and transformation
– Find useful features, dimensionality/variable reduction.
• Choosing functions of data mining
– Summarization, classification, regression, association, clustering.
• Choosing the mining algorithm(s)
• Data mining: search for patterns of interest
• Pattern evaluation and knowledge presentation
– Visualization, transformation, removing redundant patterns, etc.
• Use of discovered knowledge
7
Architecture: Typical Data Mining System
Graphical user interface
Pattern evaluation
Data mining engine
Knowledge-base
Database or data
warehouse server
Data cleaning & data integration Filtering
Data
Databases Warehouse
8
Data Mining: On What Kinds of Data?
• Relational database
• Data warehouse
• Transactional database
• Advanced database and information repository
– Spatial and temporal data
– Time-series data
– Stream data
– Multimedia database
– Text databases & WWW
9
Examples of Large Datasets
• Government: IRS, NGA, …
• Large corporations
– WALMART: 20M transactions per day
– MOBIL: 100 TB geological databases
– AT&T 300 M calls per day
– Credit card companies
• Scientific
– NASA, EOS project: 50 GB per hour
– Environmental datasets
Examples of Data mining Applications
1. Fraud detection: credit cards, phone cards
2. Marketing: customer targeting
3. Data Warehousing: Walmart
4. Astronomy
5. Molecular biology
How Data Mining is used
1. Identify the problem
2. Use data mining techniques to transform
the data into information
3. Act on the information
4. Measure the results
The Data Mining Process
1. Understand the domain
2. Create a dataset:
– Select the interesting attributes
– Data cleaning and preprocessing
3. Choose the data mining task and the specific
algorithm
4. Interpret the results, and possibly return to 2
Origins of Data Mining
• Draws ideas from machine learning/AI, pattern
recognition, statistics, and database systems
AI /
• Must address: Statistics
Machine Learning
– Enormity of data
– High dimensionality
Data Mining
of data
– Heterogeneous,
distributed nature Database
of data systems
Data Mining Functionalities
• Concept description: Characterization and discrimination
– Generalize, summarize, and contrast data characteristics
• Association (correlation and causality)
– Diaper à Beer [0.5%, 75%]
• Classification and Prediction
– Construct models (functions) that describe and distinguish classes or
concepts for future prediction
– Presentation: decision-tree, classification rule, neural network
15
Data Mining Functionalities
• Cluster analysis
– Class label is unknown: Group data to form new classes, e.g., cluster
houses to find distribution patterns
– Maximizing intra-class similarity & minimizing interclass similarity
• Outlier analysis
– Outlier: a data object that does not comply with the general behavior of
the data
– Useful in fraud detection, rare events analysis
• Trend and evolution analysis
– Trend and deviation: regression analysis
– Sequential pattern mining, periodicity analysis
16
Data Mining: Confluence of Multiple Disciplines
Database
Statistics
Systems
Machine Data Mining Visualization
Learning
Algorithm Other
Disciplines
17