Data Mining:
Concepts and Techniques
1
Introduction
Motivation: Why data mining?
What is data mining?
Data Mining: On what kind of data?
Data mining functionality
Are all the patterns interesting?
Classification of data mining systems
Major issues in data mining
2
Why Data Mining?
The Explosive Growth of Data: from terabytes to petabytes
Data collection and data availability
Automated data collection tools, database systems, Web,
computerized society
Major sources of abundant data
Business: Web, e-commerce, transactions, stocks, …
Science: Remote sensing, bioinformatics, scientific
simulation, …
Society and everyone: news, digital cameras,
We are drowning in data, but starving for knowledge!
“Necessity is the mother of invention”—Data mining—Automated
analysis of massive data sets
3
Evolution of Database
Technology
1960s:
Data collection, database creation, IMS and network DBMS
1970s:
Relational data model, relational DBMS implementation
1980s:
RDBMS, advanced data models (extended-relational, OO, deductive,
etc.)
Application-oriented DBMS (spatial, scientific, engineering, etc.)
1990s:
Data mining, data warehousing, multimedia databases, and Web
databases
2000s
Stream data management and mining
Data mining and its applications
Web technology (XML, data integration) and global information
systems
4
What Is Data Mining?
Data mining (knowledge discovery from
data)
Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful)
patterns or knowledge from huge amount of
data
Alternative name
Knowledge discovery in databases (KDD)
Watch out: Is everything “data mining”?
Query processing
Expert systems or statistical programs
5
Why Data Mining?—Potential
Applications
Data analysis and decision support
Market analysis and management
Target marketing, customer relationship
management (CRM), market basket analysis,
market segmentation
Risk analysis and management
Forecasting, customer retention, quality
control, competitive analysis
Fraud detection and detection of unusual
patterns (outliers)
6
Why Data Mining?—Potential
Applications
Other Applications
Text mining (news group, email, documents)
and Web mining
Stream data mining
Bioinformatics and bio-data analysis
7
Market Analysis and
Management
Where does the data come from?
Credit card transactions, discount coupons,
customer complaint calls
Target marketing
Find clusters of “model” customers who share
the same characteristics: interest, income level,
spending habits, etc.
Determine customer purchasing patterns over
time
8
Market Analysis and
Management
Cross-market analysis
Associations/co-relations between product sales,
& prediction based on such association
Customer profiling
What types of customers buy what products
Customer requirement analysis
Identifying the best products for different
customers
Predict what factors will attract new customers
9
Fraud Detection & Mining Unusual
Patterns
Approaches: Clustering & model construction for frauds,
outlier analysis
Applications: Health care, retail, credit card service,
telecomm.
Medical insurance
Professional patients, and ring of doctors
Unnecessary or correlated screening tests
Telecommunications:
Phone call model: destination of the call, duration, time
of day or week. Analyze patterns that deviate from an
expected norm
Retail industry
Analysts estimate that 38% of retail shrink is due to
dishonest employees 10
Other Applications
Internet Web Surf-Aid
IBM Surf-Aid applies data mining algorithms to
Web access logs for market-related pages to
discover customer preference and behavior
pages, analyzing effectiveness of Web
marketing, improving Web site organization,
etc.
11
Data Mining: A KDD Process
Data mining—core of Pattern Evaluation
knowledge discovery
process
Data Mining
Task-relevant Data
Data Selection
Warehouse
Data Cleaning
Data Integration
Databases
12
Steps of a KDD Process
Learning the application domain
Relevant prior knowledge and goals of application
Creating a target data set: data selection
Data cleaning and preprocessing: (may take 60% of effort!)
Data reduction and transformation
Find useful features, dimensionality/variable reduction.
Choosing functions of data mining
Summarization, classification, regression, association,
clustering.
Choosing the mining algorithm(s)
Data mining: search for patterns of interest
Pattern evaluation and knowledge presentation
Visualization, transformation, removing redundant patterns, etc.
Use of discovered knowledge
13
Architecture: Typical Data Mining
System
Graphical user interface
Pattern evaluation
Data mining engine
Knowledge-
Database or
data warehouse base
server
Data cleaning & data integration Filtering
Data
Databases Warehouse
14
Data Mining: On What Kinds of
Data?
Relational database
Data warehouse
Transactional database
Advanced database and information repository
Spatial and temporal data
Time-series data
Stream data
Multimedia database
Text databases & WWW
15
Data Mining Functionalities
Concept description: Characterization and
discrimination
Generalize, summarize, and contrast data characteristics
Association (correlation and causality)
Diaper Beer [0.5%, 75%]
Classification and Prediction
Construct models (functions) that describe and distinguish
classes or concepts for future prediction
Presentation: decision-tree, classification rule, neural
network
16
Data Mining Functionalities
Cluster analysis
Class label is unknown: Group data to form new classes,
e.g., cluster houses to find distribution patterns
Maximizing intra-class similarity & minimizing interclass
similarity
Outlier analysis
Outlier: a data object that does not comply with the
general behavior of the data
Useful in fraud detection, rare events analysis
Trend and evolution analysis
Trend and deviation: regression analysis
Sequential pattern mining, periodicity analysis
17
Are All the “Discovered” Patterns
Interesting?
Data mining may generate thousands of patterns: Not all of
them are interesting
Suggested approach: Human-centered, query-based, focused
mining
Interestingness measures
A pattern is interesting if it is easily understood by humans, valid
on new or test data with some degree of certainty, potentially
useful, novel, or validates some hypothesis that a user seeks to
confirm
Objective vs. subjective interestingness measures
Objective: based on statistics and structures of patterns, e.g.,
support, confidence, etc.
Subjective: based on user’s belief in the data, e.g.,
unexpectedness, novelty.
18
Data Mining: Confluence of Multiple
Disciplines
Database
Statistics
Systems
Machine
Learning
Data Mining Visualization
Algorithm Other
Disciplines
19
Data Mining: Classification
Schemes
Different views, different classifications
Kinds of data to be mined
Kinds of knowledge to be discovered
Kinds of techniques utilized
Kinds of applications adapted
20
Multi-Dimensional View of Data
Mining
Data to be mined
Relational, data warehouse, transactional,
stream, object-oriented/relational, active,
spatial, time-series, text, multi-media,
heterogeneous, WWW
Knowledge to be mined
Characterization, discrimination, association,
classification, clustering, trend/deviation, outlier
analysis, etc.
Multiple/integrated functions and mining at
multiple levels
21
Multi-Dimensional View of Data
Mining
Techniques utilized
Database-oriented, data warehouse (OLAP),
machine learning, statistics, visualization, etc.
Applications adapted
Retail, telecommunication, banking, fraud
analysis, bio-data mining, stock market
analysis, Web mining, etc.
22
OLAP Mining: Integration of Data Mining and Data
Warehousing
Data mining systems, DBMS, Data warehouse
systems coupling
On-line analytical mining data
Integration of mining and OLAP technologies
Interactive mining multi-level knowledge
Necessity of mining knowledge and patterns at different
levels of abstraction.
Integration of multiple mining functions
Characterized classification, first clustering and then
association
23
Major Issues in Data Mining
Mining methodology
Mining different kinds of knowledge from
diverse data types, e.g., bio, stream, Web
Performance: efficiency, effectiveness, and
scalability
Pattern evaluation: the interestingness problem
Incorporation of background knowledge
Handling noise and incomplete data
Parallel, distributed and incremental mining
methods
Integration of the discovered knowledge with
existing one: knowledge fusion 24
Major Issues in Data Mining
User interaction
Data mining query languages and ad-hoc
mining
Expression and visualization of data mining
results
Interactive mining of knowledge at multiple
levels of abstraction
Applications and social impacts
Domain-specific data mining & invisible data
mining
Protection of data security, integrity, and
privacy
25
Summary
Data mining: discovering interesting patterns from large
amounts of data
A natural evolution of database technology, in great demand,
with wide applications
A KDD process includes data cleaning, data integration, data
selection, transformation, data mining, pattern evaluation, and
knowledge presentation
Mining can be performed in a variety of information repositories
Data mining functionalities: characterization, discrimination,
association, classification, clustering, outlier and trend analysis,
etc.
Data mining systems and architectures
Major issues in data mining
26
Where to Find References?
More conferences on data mining
PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM (2001), etc.
Data mining and KDD
Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.
Journal: Data Mining and Knowledge Discovery, KDD Explorations
Database systems
Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA
Journals: ACM-TODS, IEEE-TKDE, JIIS, J. ACM, etc.
AI & Machine Learning
Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), etc.
Journals: Machine Learning, Artificial Intelligence, etc.
Statistics
Conferences: Joint Stat. Meeting, etc.
Journals: Annals of statistics, etc.
Visualization
Conference proceedings: CHI, ACM-SIGGraph, etc.
Journals: IEEE Trans. visualization and computer graphics, etc. 27