5/3/2021
Introduction to Data
Mining
1
5/3/2021
Introduction
• Data is growing at a phenomenal rate
• Users expect more sophisticated information
• How?
UNCOVER HIDDEN INFORMATION
DATA MINING
Data Mining Definition
• Finding hidden information in a database
• Fit data to a model
• Similar terms
• Exploratory data analysis
• Data driven discovery
• Deductive learning
2
5/3/2021
Data Mining Algorithm
• Objective: Fit Data to a Model
• Descriptive
• Predictive
• Preference – Technique to choose the best model
• Search – Technique to search the data
• “Query”
Database Processing vs. Data
Mining Processing
• Query • Query
• Well defined • Poorly defined
• SQL • No precise query
language
◼ Data ◼ Data
– Operational data – Not operational data
◼ Output ◼ Output
– Precise – Fuzzy
– Subset of database – Not a subset of database
3
5/3/2021
Query Examples
• Database
– Find all credit applicants with last name of Smith.
– Identify customers who have purchased more than $10,000 in the
last month.
– Find all customers who have purchased milk
• Data Mining
– Find all credit applicants who are poor credit risks. (classification)
– Identify customers with similar buying habits. (Clustering)
– Find all items which are frequently purchased with milk. (association
rules)
Basic Data Mining Tasks
• Classification maps data into predefined groups or
classes
• Supervised learning
• Prediction
• Regression
• Clustering groups similar data together into
clusters.
• Unsupervised learning
• Segmentation
• Partitioning
4
5/3/2021
Basic Data Mining Tasks (cont’d)
• Link Analysis uncovers relationships among data.
• Affinity Analysis
• Association Rules
• Sequential Analysis determines sequential patterns.
CLASSIFICATION
• Assign data into predefined groups or classes.
10
10
5
5/3/2021
But it isn’t Magic
• You must know what you are looking for
• You must know how to look for you
Suppose you knew that a specific cave had gold:
What would you look for?
How would you look for it?
Might need an expert miner
11
11
“If it looks like a duck,
walks like a duck, and
quacks like a duck, then
it’s a duck.”
“If it looks like a terrorist,
walks like a terrorist, and
quacks like a terrorist, then
it’s a terrorist.”
Description Behavior Associations
Classification Clustering Link Analysis
(Profiling) (Similarity)
12
12
6
5/3/2021
Classification Ex: Grading
x
<90 >=90
x A
<80 >=80
x B
<70 >=70
x
C
<50 >=60
F D
13
13
Given a collection of annotated Katydids
data. (in this case 5 instances of
Katydids and five of Grasshoppers),
decide what type of insect the
unlabeled example is.
Grasshoppers
14
14
7
5/3/2021
Insect ID Abdomen Antennae Insect Class
Length Length
1 2.7 5.5 Grasshopper
The classification
2 8.0 9.1 Katydid
problem can now be
3 0.9 4.7 Grasshopper
expressed as:
4 1.1 3.1 Grasshopper
5 5.4 8.5 Katydid
Given a training 6 2.9 1.9 Grasshopper
database predict the 7 6.1 6.6 Katydid
class label of a 8 0.5 1.0 Grasshopper
previously unseen 9 8.3 6.6 Katydid
instance 10 8.1 4.7 Katydid
previously unseen instance = 11 5.1 7.0 ???????
15
15
10
9
8
7
Antenna Length
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Abdomen Length
16
Grasshoppers Katydids
16
8
5/3/2021
Facial Recognition
17
17
Handwriting
Recognition
0.5
0
0 50 100 150 200 250 300 350 400 450
18
George Washington Manuscript
18
9
5/3/2021
Anomaly Detection
19
19
20
20
10
5/3/2021
CLUSTERING
• Partition data into previously undefined groups.
21
21
22
22
11
5/3/2021
What is Similarity?
23
23
Two Types of Clustering
Hierarchical Partitional
24
24
12
5/3/2021
Hierarchical Clustering Example
Iris Data Set
Versicolor
Sentosa Virginica
25
25
http://www.time.com/time/magazine/article/0,9171,1541283,00.html
26
26
13
5/3/2021
Microarray Data Analysis
• Each probe location associated with gene
• Color indicates degree of gene expression
• Compare different samples (normal/disease)
• Track same sample over time
• Questions
• Which genes are related to this disease?
• Which genes behave in a similar manner?
• What is the function of a gene?
• Clustering
• Hierarchical
• K-means
27
27
Microarray Data - Clustering
"Gene
expression
profiling
identifies
clinically
relevant
subtypes
of prostate
cancer"
Proc. Natl.
Acad. Sci.
USA, Vol. 101,
Issue 3, 811-
816, January
20, 2004
28
28
14
5/3/2021
ASSOCIATION RULES/
LINK ANALYSIS
• Find relationships between data
29
29
ASSOCIATION RULES
EXAMPLES
• People who buy diapers also buy beer
• If gene A is highly expressed in this disease then
gene A is also expressed
• Relationships between people
• Book Stores
• Department Stores
• Advertising
• Product Placement
30
30
15
5/3/2021
Data Mining Introductory and Advanced Topics, by Margaret H. Dunham, Prentice Hall, 2003.
DILBERT reprinted by permission of United Feature Syndicate, Inc.
31
31
Joshua Benton and Holly
K. Hacker, “At Charters,
Cheating’s off the Charts:,
Dallas Morning News,
June 4, 2007.
32
32
16
5/3/2021
No/Little Cheating
Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s
off the Charts:, Dallas Morning News, June 4, 2007.
33
33
Rampant Cheating
Joshua
Benton and
Holly K.
Hacker, “At
Charters,
Cheating’s
off the
Charts:,
Dallas
Morning
News, June
4, 2007.
34
34
17
5/3/2021
Jialun Qin, Jennifer J. Xu, Daning
Marc Sageman and Hsinchun
“Analyzing Terrorist Networks: A Case
Study of the Global Salafi Jihad
Network” 35 Lecture Notes in Computer
Science, Publisher: Springer
GmbH, Volume 3495 / 2005 , p. 287.
35
Ex: Stock Market Analysis
• Example: Stock Market
• Predict future values
• Determine similar patterns over time
• Classify behavior
36
36
18
5/3/2021
Ex: Stock Market Analysis
37
37
Data Mining vs. KDD
• Knowledge Discovery in Databases (KDD): process
of finding useful information and patterns in data.
• Data Mining: Use of algorithms to extract the
information and patterns derived by the KDD
process.
38
38
19
5/3/2021
KDD Process
Modified from [FPSS96C]
• Selection: Obtain data from various sources.
• Preprocessing: Cleanse data.
• Transformation: Convert to common format.
Transform to new format.
• Data Mining: Obtain desired results.
• Interpretation/Evaluation: Present results to user in
meaningful manner.
39
39
KDD Process Ex: Web Log
• Selection:
• Select log data (dates and locations) to use
• Preprocessing:
• Remove identifying URLs; Remove error logs
• Transformation:
• Sessionize (sort and group)
• Data Mining:
• Identify and count patterns; Construct data structure
• Interpretation/Evaluation:
• Identify and display frequently accessed sequences.
• Potential User Applications:
• Cache prediction
• Personalization
40
40
20
5/3/2021
Related Topics
• Databases
• OLTP
• OLAP
• Information Retrieval
41
41
DB & OLTP Systems
• Schema
• (ID,Name,Address,Salary,JobNo)
• Data Model
• ER
• Relational
• Transaction
• Query:
SELECT Name
FROM T
WHERE Salary > 100000
DM: Only imprecise queries
42
42
21
5/3/2021
Classification/Prediction is Fuzzy
Loan Reject Reject
Amnt
Accept Accept
Simple Fuzzy
43
43
Information Retrieval
• Information Retrieval (IR): retrieving desired information
from textual data.
• Library Science
• Digital Libraries
• Web Search Engines
• Traditionally keyword based
• Sample query:
Find all documents about “data mining”.
DM: Similarity measures;
Mine text/Web data.
44
44
22
5/3/2021
Information Retrieval (cont’d)
• Similarity: measure of how close a query is to a
document.
• Documents which are “close enough” are retrieved.
• Metrics:
• Precision = |Relevant and Retrieved|
|Retrieved|
• Recall = |Relevant and Retrieved|
|Relevant|
45
45
IR Query Result Measures and
Classification
IR Classification
46
46
23
5/3/2021
OLAP
• Online Analytic Processing (OLAP): provides more complex
queries than OLTP.
• OnLine Transaction Processing (OLTP): traditional
database/transaction processing.
• Dimensional data; cube view
• Visualization of operations:
• Slice: examine sub-cube.
• Dice: rotate cube to look at another dimension.
• Roll Up/Drill Down
DM: May use OLAP queries.
47
47
DM vs. Related Topics
Area Query Data Results Output
DB/OLTP Precise Database Precise DB Objects
or
Aggregation
IR Precise Documents Vague Documents
OLAP Analysis Multidimensional Precise DB Objects
or
Aggregation
DM Vague Preprocessed Vague KDD
Objects
48
48
24
5/3/2021
Data Mining Development
•Similarity Measures
•Hierarchical Clustering
•Relational Data Model •IR Systems
•SQL •Imprecise Queries
•Association Rule Algorithms •Textual Data
•Data Warehousing
•Scalability Techniques •Web Search Engines
•Bayes Theorem
•Regression Analysis
•EM Algorithm
•K-Means Clustering
•Time Series Analysis
•Algorithm Design Techniques
•Algorithm Analysis •Neural Networks
•Data Structures
•Decision Tree Algorithms
49
49
KDD Issues
• Human Interaction
• Overfitting
• Outliers
• Interpretation
• Visualization
• Large Datasets
• High Dimensionality
50
50
25
5/3/2021
Overfitting
• Suppose we want to predict whether an individual is short,
medium, or tall. What is wrong with this data?
Name Gender Height Output
Mary F 1.6 Short
Maggie F 1.9 Medium
Martha F 1.88 Medium
Stephanie F 1.7 Short
Bob M 1.85 Medium
Kathy F 1.6 Short
George M 1.7 Short
Debbie F 1.8 Medium
Todd M 1.95 Medium
Kim F 1.9 Medium
Amy F 1.8 Medium
Wynette F 1.75 Medium
51
51
KDD Issues (cont’d)
• Multimedia Data
• Missing Data
• Irrelevant Data
• Noisy Data
• Changing Data
• Integration
• Application
52
52
26
5/3/2021
WARNING
• With data mining you don’t always know what you
are looking for.
• There is not one right answer.
• The data you are using is noisy
• Data Mining is a very applied discipline.
• A data mining course provides you tools to use to
analyze data.
• Experience provides you knowledge of how to use
these tools.
53
53
54
http://ieeexplore.ieee.org/iel5/6/32236/01502526.pdf?tp=&arnumber=1502526&isnumber=32236
54
27
5/3/2021
55
55
Social Implications of DM
• Privacy
• Profiling
• Unauthorized use
• Invalid results and claims
56
56
28
5/3/2021
Data Mining Metrics
• Usefulness
• Return on Investment (ROI)
• Accuracy
•…
• Space/Time
57
57
Visualization Techniques
• Graphical
• Geometric
• Icon-based
• Pixel-based
• Hierarchical
• Hybrid
58
58
29
5/3/2021
Models Based on Summarization
• Visualization: Frequency distribution, mean,
variance, median, mode, etc.
• Box Plot:
59
59
DM Tools
• XLMiner – Easy addin to Excel
http://www.solver.com/xlminer/index.html
• Weka – Open Source; Visualization, Functionality,
Interface
http://www.cs.waikato.ac.nz/ml/weka/
• SAS (JMP) – Commercial Product
• SPSS – Commercial Product
• MATLAB – Statistical/Math Applications
• R – Programming
61
61
30
5/3/2021
62
Thank you
for your
attentions!
63
31