Why Data Mining?
The Explosive Growth of Data: from terabytes to petabytes
Data collection and data availability
Automated data collection tools, database systems, Web,
computerized society
Major sources of abundant data
Business: Web, e-commerce, transactions, stocks, …
Science: Remote sensing, bioinformatics, scientific
simulation, …
Society and everyone: news, digital cameras, YouTube
We are drowning in data, but starving for knowledge!
“Necessity is the mother of invention”—Data mining—
Automated analysis of massive data sets
1
July 31, 2025 Data Mining: Concepts and Techniques 2
What Is Data Mining?
Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge
from huge amount of data
Alternative names
Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting,
business intelligence, etc.
3
Data mining as a step in the process of knowledge
discovery
July 31, 2025 Data Mining: Concepts and Techniques 4
Knowledge Discovery
Process
Data Cleaning: (to remove noise and
inconsistent data)
Data integration:(where multiple data
sources may be combined)
Data selection: (where data relevant to the
analysis task are retrieved from the
database)
Data transformation : (where data are
transformed and consolidated into forms
appropriate for mining by performing
summary or aggregation operations)
July 31, 2025 Data Mining: Concepts and Techniques 5
Knowledge Discovery
Process(continue..)
Data mining : (an essential process where
intelligent methods are applied to extract
data patterns)
Pattern evaluation: (to identify the truly
interesting patterns representing
knowledge based on interesting measures)
Knowledge presentation : (where
visualization and knowledge representation
techniques are used to present mined
knowledge to users)
July 31, 2025 Data Mining: Concepts and Techniques 6
Data Mining: On What Kinds of
Data?
Database-oriented data sets and applications
Relational database, data warehouse, transactional database
Advanced data sets and advanced applications
Data streams and sensor data
Time-series data, temporal data, sequence data (incl. bio-
sequences)
Structure data, graphs, social networks and multi-linked data
Object-relational databases
Heterogeneous databases and legacy databases
Spatial data and spatiotemporal data
Multimedia database
Text databases
The World-Wide Web
7
What kinds of Patterns can be
Mined?
Data mining functionalities are:
Characterization and Discrimination
The mining of frequent patterns,Associations
and Correlations
Classification and Regression for Predictive
Analysis
Cluster Analysis
Outlier Analysis
Two categories:
Descriptive
Predictive
July 31, 2025 Data Mining: Concepts and Techniques 8
Class/Concept Description:
Characterization and Discrimination
Data Characterization: Summarization of
the general characteristics or features of a
target class of Data
Ex: Summarize the characteristics of
customers who spend more than $5000 a
year
The result is a general profile of these
customers such as that they are 40 to 50
years old, employed and have excellent
credit rating.
July 31, 2025 Data Mining: Concepts and Techniques 9
Data Discrimination
It is a comparison of the general features of
the target class data objects against the
general features of objects from one or
multiple contrasting classes.
Example: Compare two groups of customers
–those who shop for computer products
regularly and those who rarely shop for
such products
The resulting description provides a general
comparative profile of these customers
July 31, 2025 Data Mining: Concepts and Techniques 10
Mining Frequent Patterns, Associations
and Correlations
Frequent patterns: patterns that occur
frequently in data.
Frequent itemset: set of items that often
appear together in a transactional data set.
Frequent sequence patterns: A frequently
occurring subsequence such as the pattern
that customers tend to purchase first a
laptop,followed by a digital camera and
then a memory card.
Frequent structured patterns
July 31, 2025 Data Mining: Concepts and Techniques 11
Mining Frequent Patterns, Associations
and Correlations(continues..)
computer⇒antivirus software
[support=2%,confidence=60%].
The rule A ⇒ B holds in the transaction set D
with supports s, where s is the percentage of
transactions in D that contain AUB (the union
of sets A and B, i.e., it contains every item in
A and B). This is taken to be the probability,
Support (A ⇒ B )=P(AUB)
The rule A ⇒ B has confidence c in the
transaction set D, where c is the percentage
of transactions in D containing A that also
contain B. This is taken to be the conditional
probability,P(A|B) .
Confidence (A ⇒ B)= P(B|A)
July 31, 2025 Data Mining: Concepts and Techniques 12
Mining Frequent Patterns, Associations
and Correlations(continues..)
Single dimensional Association rule
buys(X,”Computer”) ⇒ buys(X,”Computer”)
[ support=1%, confidence=50%]
Multidimensional association rule:
age(X,”20..29”) ˄ income(X,”40K..49K) ⇒
buys(X,”laptop”)
[support=2%,confidence=60%]
July 31, 2025 Data Mining: Concepts and Techniques 13
Classification and Regression for Predictive
Analysis
Classification is the process of finding a
model that describes and distinguishes data
classes or concepts
The model are derived based on the
analysis of a set of training data
The model is used to predict the class label
or objects for which class label is unknown.
July 31, 2025 Data Mining: Concepts and Techniques 14
Classification and Regression for Predictive
Analysis(continue..)
The derived model may be represented in
various forms:
Decision Trees
IF-THEN Rules
Neural Networks
July 31, 2025 Data Mining: Concepts and Techniques 15
Decision Tree
July 31, 2025 Data Mining: Concepts and Techniques 16
IF –THEN rules
July 31, 2025 Data Mining: Concepts and Techniques 17
Neural Networks
July 31, 2025 Data Mining: Concepts and Techniques 18
Regression Models
Regression used to predict missing or
unavailable numerical data values rather
than discrete class labels
Regression analysis is a statistical
methodology that is most often used for
numerical prediction
July 31, 2025 Data Mining: Concepts and Techniques 19
Cluster Analysis
July 31, 2025 Data Mining: Concepts and Techniques 20
Cluster Analysis(Continue..)
Unsupervised learning (i.e., Class label is
unknown)
Group data to form new categories (i.e.,
clusters), e.g., cluster houses to find
distribution patterns
Principle: Maximizing intra-class similarity &
minimizing interclass similarity
Data Mining: Concepts and
July 31, 2025 Techniques 21
Outlier Analysis
July 31, 2025 Data Mining: Concepts and Techniques 22
Outlier Analysis(Continue..)
Outlier analysis
Outlier: A data object that does not comply
with the general behavior of the data
Noise or exception? ― One person’s
garbage could be another person’s treasure
Useful in fraud detection, rare events
analysis
Data Mining: Concepts and
July 31, 2025 Techniques 23
Are All Patterns Interesting?
Data Mining may generate thousands of
patterns but not all of them are interesting.
A pattern is interesting if it is
easily understood by humans
valid on new or test data with some
degree of certainty
Potentially useful
Novel
Validates a hypothesis that a user seeks to
confirm
An interesting pattern represents
knowledge
Data Mining: Concepts and
July 31, 2025 Techniques 24
Are All Patterns Interesting?
(Continue..)
Objective measures:
These are based on the structure of
discovered patterns and the statistics
underlying them.
Eg: Support,confidence etc
(Rules that do not satisfy a threshold are
considered uninteresting)
Subjective measures:
Reflect the needs and interests of particular
user
Based on user beliefs in the data
Objective and subjective measures need to
be combined Data Mining: Concepts and
July 31, 2025 Techniques 25
Are All Patterns Interesting?
(Continue..)
Find all the interesting
patterns:Completeness
Unrealistic and inefficient
User provided constraints and interesting
measures should be used.
Search for only interesting patterns:An
optimization problem
highly desirable
No need to search through the generated
patterns to identify truly interesting ones
Measures can be used to rank the
discovered patterns according their
interestingness
Data Mining: Concepts and
July 31, 2025 Techniques 26
Which Technologies are
used?
Data Mining: Concepts and
July 31, 2025 Techniques 27
What kinds of Applications are
targeted?
Business intelligence
Web Search Engines
Data Mining: Concepts and
July 31, 2025 Techniques 28
Major Issues in Data Mining
(1)
Mining Methodology
Mining various and new kinds of knowledge
Mining knowledge in multi-dimensional space
Data mining: An interdisciplinary effort
Boosting the power of discovery in a networked environment
Handling noise, uncertainty, and incompleteness of data
Pattern evaluation and pattern- or constraint-guided mining
User Interaction
Interactive mining
Incorporation of background knowledge
Presentation and visualization of data mining results
29
Major Issues in Data Mining
(2)
Efficiency and Scalability
Efficiency and scalability of data mining algorithms
Parallel, distributed, stream, and incremental mining
methods
Diversity of data types
Handling complex types of data
Mining dynamic, networked, and global data repositories
Data mining and society
Social impacts of data mining
Privacy-preserving data mining
Invisible data mining
30