0% found this document useful (0 votes)
11 views84 pages

Module 4

Uploaded by

samaymistry105
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views84 pages

Module 4

Uploaded by

samaymistry105
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 84

Unit: 4

Data Mining &


Knowledge Discovery
2
Definitions & Motivations
Why Data Mining?
Explosive Growth of Data:, Web, etc. from terabytes to petabytes
Data Collections and Data Availability
Crawlers, database systems

Sources
B usiness: Web, e-commerce, transactions, etc.
Sc ience: Remote sensing, bioinformatics, etc.
Society and everyone: news, YouTube, etc.

Problem: We are drowning in data, but starving for knowledge!


Solution: Use Data Mining tools for Automated Analysis of massive data sets
Why Data Mining

➝Credit ratings/targeted marketing:


⇾Given a database of 100,000 names, which persons are the least
likely to default on their credit cards?
⇾Identify likely responders to sales promotions

➝Fraud detection
⇾Which types of transactions are likely to be fraudulent, given the
demographics and transactional history of a particular customer?

➝Customer relationship management:


⇾Which of my customers are likely to be the most loyal, and which are
most likely to leave for a competitor? :

Data Mining helps extract such information


What is Data Mining?
Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit, previously unknown
and potentially useful) patterns or knowledge from huge amount of
data
Data mining: a misnomer?

Gold Mining
Stone Not Stone Mining

Data Knowledge Knowledge Mining?

Alternative names
Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.
Applications
➝Banking: loan/credit card approval
⇾predict good customers based on old customers

➝Customer relationship management:


⇾identify those who are likely to leave for a competitor.

➝Targeted marketing:
⇾identify likely responders to promotions

➝Fraud detection: telecommunications, financial


transactions
⇾from an online stream of event identify fraudulent events

➝Manufacturing and production:


⇾automatically adjust knobs when process parameter changes
Applications (continued)

➝Medicine: disease outcome, effectiveness of treatments


⇾analyze patient disease history: find relationship between diseases

➝Molecular/Pharmaceutical: identify new drugs

➝Scientific data analysis:


⇾identify new galaxies by searching for sub clusters

➝Web site/store design and promotion:


⇾find affinity of visitor to pages and modify layout
Examples: What is (not) Data Mining?

l What is not Data Mining? l What is Data Mining?


– Certain names are more
– Look up phone number in prevalent in certain US locations
phone directory (O’Brien, O’Rurke, O’Reilly… in
Boston area)
– Query a Web search – Group together similar
engine for information about documents returned by search
“Amazon” engine according to their context
(e.g. Amazon rainforest,
[Link],)
Query Examples

➝Database
– Find all credit applicants with last name of Smith.

– Identify customers who have purchased more than $10,000 in the last month.

– Find all customers who have purchased milk

➝Data Mining
– Find all credit applicants who are poor credit risks. (classification)

– Identify customers with similar buying habits. (Clustering)

– Find all items which are frequently purchased with milk. (association rules)
Knowledge
DATA
Discovery

10
Why do we need KDD ?

11
Knowledge Discovery in Database

12
Data Mining as a step
in the knowledge discovery process
Data mining plays
an essential role in the
knowledge discovery process

13
Knowledge Discovery Process

•Goals STEP – 1: IDENTIFYING THE GOAL


•Data Selection,
Acquisition & • First step is developing an
Integration understanding of the application
•Data Cleaning domain and the relevant prior
•Data reduction and knowledge and identifying the
Projection
goal of the KDD process from the
•Matching the goals
•Exploratory Data customer’s viewpoint.
Analysis
•Data Mining
•Interpretation and
Testing
•Consolidation & Use
Knowledge Discovery Process

•Goals STEP – 2: CREATING A TARGET DATA SET


•Data Selection,
Acquisition & • Selecting a data set, or focusing on
Integration a subset of variables or data
•Data Cleaning samples, on which discovery is to
•Data reduction and be performed.
Projection
•Matching the goals
•Exploratory Data
Analysis
•Data Mining
•Interpretation and
Testing
•Consolidation & Use
Knowledge Discovery Process

•Goals STEP – 3: DATA CLEANING AND


•Data Selection, PREPROCESSING
Acquisition &
Integration • Basic operations include removing
•Data Cleaning noise if appropriate, collecting the
•Data reduction and
necessary information to model or
Projection
account for noise, deciding on
•Matching the goals
•Exploratory Data strategies for handling missing data
Analysis fields, and accounting for time-
•Data Mining sequence information and known
•Interpretation and changes.
Testing
•Consolidation & Use
Knowledge Discovery Process

•Goals STEP – 4: DATA REDUCTION


•Data Selection, AND PROJECTION
Acquisition &
Integration • Finding useful features to represent the
•Data Cleaning data depending on the goal of the task.
•Data reduction and • With dimensionality reduction or
Projection transformation methods, the effective
•Matching the goals
number of variables under
•Exploratory Data
Analysis
consideration can be reduced, or
•Data Mining invariant representations for the data
•Interpretation and can be found.
Testing
•Consolidation & Use
Knowledge Discovery Process

•Goals STEP – 5: MATCHING THE GOALS


•Data Selection,
Acquisition & • Matching the goals of the KDD process
Integration
to a particular data-mining method
•Data Cleaning
•Data reduction and
such as summarization, classification,
Projection regression, clustering, Association etc.
•Matching the goals
•Exploratory Data
Analysis
•Data Mining
•Interpretation and
Testing
•Consolidation & Use
Knowledge Discovery Process

•Goals STEP – 6: EXPLORATORY ANALYSIS


•Data Selection, AND MODEL & HYPOTHESIS
Acquisition & SELECTION
Integration • Choosing the data mining algorithms
•Data Cleaning and selecting methods to be used for
•Data reduction and searching for data patterns.
Projection • This process includes deciding which
•Matching the goals
models and parameters might be
•Exploratory Data
Analysis
appropriate and matching a particular
•Data Mining data-mining method with the overall
•Interpretation and criteria of the KDD process.
Testing
•Consolidation & Use
Data Mining Techniques

Data Mining Techniques

Descriptive Predictive

Clustering Classification

Association Decision Tree

Sequential Analysis Rule Induction

Neural Networks

Nearest Neighbor Classification

Regression
Knowledge Discovery Process

•Goals STEP – 7: DATA MINING


•Data Selection,
Acquisition &
▪ Searching for patterns of
Integration interest in a particular
•Data Cleaning representational form or a set of
•Data reduction and such representations, including
Projection classification rules or trees,
•Matching the goals
•Exploratory Data
regression, and clustering.
Analysis ▪ The user can significantly aid
•Data Mining the data- mining method by
•Interpretation and correctly performing the
Testing
•Consolidation & Use
preceding steps.
Knowledge Discovery Process

•Goals STEP – 8: INTERPRETATION & TESTING


•Data Selection,
Acquisition & • Interpreting mined patterns,
Integration possibly returning to any of steps 1
•Data Cleaning through 7 for further iteration.
•Data reduction and
Projection
• This step can also involve
•Matching the goals visualization of the extracted
•Exploratory Data patterns and models or
Analysis visualization of the data given the
•Data Mining extracted models.
•Interpretation and
Testing
•Consolidation & Use
Knowledge Discovery Process

•Goals STEP – 9: KNOWLEDGE PRESENTATION


•Data Selection, • Using the knowledge directly,
Acquisition & incorporating the knowledge into
Integration
•Data Cleaning
another system for further action,
•Data reduction and or simply documenting it and
Projection reporting it to interested parties.
•Matching the goals • This process also includes
•Exploratory Data
Analysis
checking for and resolving
•Data Mining potential conflicts with previously
•Testing and Verification believed (or extracted) knowledge.
•Interpretation
•Consolidation & Use
OLAP - On-line
Analytical
Processing
◦ Provides you
with a very
good view of
what is
happening, but
can not predict
what will
happen in the
future or why it
is happening

24
25
26
27
28
 Data mining algorithms need large amounts of data,
more so at the detailed level. Most data warehouses
contain data at the lowest level of granularity.

 Data mining flourishes on integrated and cleansed


data. If ETL functions were carried out properly, data
warehouse will contain such data, very suitable for data
mining.

 The infrastructure for data warehouses is already


robust, with parallel processing technology and
powerful relational database systems. Because such
scalable hardware is already in place, no new
investment is needed to support data mining.

29
30
 A Predictive model makes a prediction about
values of data using known results found
from different data
 Predictive modeling may be based on the use
of other historical data

 A Descriptive model identifies patterns or


relationships in data. It serves as a way to
explore the properties of the data examined,
not to predict new properties.

31
 Classification maps data into predefined
groups or classes
◦ Supervised learning
◦ Pattern recognition

 Clustering groups similar data together


into clusters.
◦ Unsupervised learning
◦ Segmentation or Partitioning of data into sets

32
 Link Analysis uncovers relationships among
data.
◦ Affinity Analysis
◦ Association Rules
◦ Sequential Analysis determines sequential patterns.

33
 Also called as link analysis or affinity analysis.
 It is one of the data mining technique to
uncover the relationships among data.
 An association rule is a model that identifies
specific types of data associations.
 Association rule can be used to assist retail
store management in effective advertising,
marketing & inventory control.

34
Supervised learning
• Classification and regression

Unsupervised learning
• Clustering
• Associations

35
 Maps the data into predefined groups or classes.
 Also called as supervised learning as classes are
determined before examining the data.
 Classification algo requires that classes be defined
based on data attribute values.
 Classes are described by looking at characteristics of
data that is already known.
 Pattern recognition is a type of classification where an
input pattern is classified into one of several classes
based on its similarity of predefined classes.

36
 An emergency room in a hospital measures 17
variables (e.g., blood pressure, age, etc) of
newly admitted patients.
 A decision is needed: whether to put a new
patient in an intensive-care unit (ICU) .
 Due to the high cost of ICU, those patients
who may survive less than a month are given
higher priority.
 Problem: to predict high-risk patients and
discriminate them from low-risk patients.

37
 A credit card company receives thousands of
applications for new cards. Each application
contains information about an applicant,
◦ age
◦ Marital status
◦ annual salary
◦ outstanding debts
◦ credit rating
◦ etc.
 Problem: to decide whether an application
should approved, or to classify applications
into two categories, approved and not
approved.

38
 Like human learning from past experiences.
 A computer does not have “experiences”.
 A computer system learns from data, which
represent some “past experiences” of an
application domain.
 Our focus: learn a target function that can be
used to predict the values of a discrete class
attribute, e.g., approve or not-approved, and
high-risk or low risk.
 The task is commonly called: Supervised
learning, classification, or inductive learning.

39
 Similar to classification except that the groups
are not predefined.
 Also referred as unsupervised learning or
segmentation.
 Clustering is accomplished by determining the
similarity among the data on predefined
attributes.
 Most similar data are grouped into clusters
 As clusters are not predefined, a domain
expert is required to interpret the meaning of
created clusters.

40
 Supervised learning: classification is seen as
supervised learning from examples.
◦ Supervision: The data (observations, measurements,
etc.) are labeled with pre-defined classes. It is like
that a “teacher” gives the classes (supervision).
◦ Test data are classified into these classes too.
 Unsupervised learning (clustering)
◦ Class labels of the data are unknown
◦ Given a set of data, the task is to establish the
existence of classes or clusters in the data

41
DATA
PREPROCESSING

42
Confluence of Multiple Disciplines

Database
Technology Statistics

Machine Visualization
Learning Data Mining

Pattern
Recognition Other
Disciplines
Algorithm
Multi-Dimensional Measure of Data Quality

▪ A well-accepted multidimensional view:


– Accuracy

– Completeness

– Consistency

– Timeliness

– Believability

– Interpretability
Data Preprocessing Steps

▪ Data Cleaning
– Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
▪ Data Integration
– Combines data from multiple sources(multiple databases,
data cubes, or flat files) & provides a unified view of data.
▪ Data Transformation & discretization
– Transform data in appropriate forms suitable for mining
process. Ex. Normalization, aggregation, discretization
▪ Data Reduction
– Obtains reduced representation in volume but produces the
same or similar analytical results. Ex. Data Cube Aggregation,
Attribute Subset Selection
Forms of Data Preprocessing
Data Preprocessing
Data Cleaning

▪ Importance
– “Data cleaning is one of the three biggest problems in data
warehousing”—Ralph Kimball
– “Data cleaning is the number one problem in data
warehousing”—DCI survey

▪ Data cleaning tasks


– Fill in missing values
– Identify outliers and smooth out noisy data
– Correct inconsistent data
– Resolve redundancy caused by data integration
Data Cleaning

▪ Data in the real world is dirty


– incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data.
e.g., occupation=“”
– noisy: containing errors or outliers. e.g., Salary=“-10”,
Age=“222”
– inconsistent: containing discrepancies in codes or names.

• e.g., Age=“42” Birthday=“03/07/1997”

• e.g., Was rating “1,2,3”, now rating “A, B, C”

• e.g., discrepancy between duplicate records


Missing Data

▪ Data is not always available. Many tuples have no recorded


value for several attributes, such as customer income in sales
data
▪ Missing data may be due to
– Equipment malfunction
– Inconsistent with other recorded data and thus deleted
– Data not entered due to misunderstanding
– “Not applicable” data value or when collected
– Certain data not considered important at time of entry
– Different considerations between the time when the data was
collected and when it is analyzed.
– Human/hardware/software problems
Noisy Data

▪ Noise: random error or variance in a measured variable

▪ Noisy data (incorrect values) may come from


– Faulty data collection instruments
– data entry problems

– data transmission problems

– technology limitation

– inconsistency in naming convention

▪ Inconsistent data may come from


– Different data sources
– Functional dependency violation (e.g., modify some linked
data)
▪ Duplicate records also need data cleaning
How to Handle Missing Data?

▪ Ignore the tuple: not effective when the percentage of missing


values per attribute varies considerably.
▪ Fill in the missing value manually: tedious + infeasible?
▪ Fill in it automatically with
– a global constant : e.g., “unknown”, a new class?
– the attribute mean/ median/ mode

– the attribute mean for all samples belonging to the same class

– the most probable value: inference-based such as Bayesian


formula or decision tree
How to Handle Noisy Data?

▪ Binning
– first sort data and partition into (equal-frequency) bins

– then one can smooth by bin means, smooth by bin median,


smooth by bin boundaries, etc.
▪ Regression
– smooth by fitting the data into regression functions

▪ Clustering
– detect and remove outliers

▪ Combined computer and human inspection


– detect suspicious values and check by human (e.g., deal
with possible outliers)
Binning Methods

▪ Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26,
28, 29, 34
▪ Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
▪ Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
▪ Smoothing by bin boundaries:
(4,?,?,15) (lowest value,_,_,highest value)
- Bin 1: 4, 4, 4, 15 (8-4=4,15-8 = 7) →(4,4,?,15)
- Bin 2: 21, 21, 25, 25 (9-4=5,15-9 = 6) →(4,4,4,15)
- Bin 3: 26, 26, 26, 34
Data Integration

▪ Combines data from multiple sources into a coherent store


▪ Detecting and resolving data value conflicts
▪ For the same real world entity, attribute values from different
sources are different
▪ Possible reasons: different representations, different scales,
e.g., metric vs. British units

▪ Entity identification problem


– Schema integration - [Link]-id  [Link]#
– Object Matching - Identify real world entities from multiple
data sources, e.g., Bill Clinton = William Clinton
Handling Redundancy in Data Integration

▪ Redundant data occur often when integration of multiple


databases
– Derivable data: One attribute may be a “derived” attribute
in another table, e.g., annual revenue
– Inconsistencies in attribute naming can also cause
redundancies in resulting data set

▪ Redundant attributes may be able to be detected by


correlation analysis
Correlation Analysis (Numerical Data)

▪ Correlation coefficient (Pearson’s product moment


coefficient)
rA, B =  ( A − A)(B − B)  ( AB) − n AB
=
(n − 1)AB (n − 1)AB
A B
n - number of tuples, & - respective means of A and B, σA
& σB - respective standard deviation of A and B, Σ(AB) - sum
of the AB cross-product.

▪ If rA,B > 0, A and B are positively correlated (A’s values


increase as B’s). The higher, the stronger correlation.
▪ rA,B = 0: independent;
▪ rA,B < 0: negatively correlated
Correlation Analysis (Categorical Data)

(Observed − Expected) 2
▪ Χ2 (chi-square) test 2 = 
Expected

▪ The larger the Χ2 value, the more likely the variables are
related
▪ Correlation does not imply causality
– # of hospitals and # of car-theft in a city are correlated
– Both are causally linked to the third variable: population
Chi-Square Calculation: An Example

Play chess Not play chess Sum (row)


Like science fiction 250(90) 200(360) 450

Not like science fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

▪ Χ2 chi-square calculation (numbers in parenthesis are


expected counts calculated based on the data distribution in
the two categories)
( 250 − 90) 2
(50 − 210) 2
( 200 − 360) 2
(1000 − 840) 2
2 = + + + = 507.93
90 210 360 840
▪ Itshows that like_science_fiction and play_chess are
correlated in the group
59
Data Transformation

▪ Ex. Bithday date → age, rupees → dollar

▪ Smoothing
– Binning, regression, clustering

▪ Normalization
– scale the data values in a specified range (Ex. -1.0 to 1.0 or
0.0 to 1.0)

▪ Aggregation
– In Aggregation, summary or aggregation operations are
applied to the data. For example, daily sales data may be
aggregated so as to compute monthly and annual total
amounts.
▪ Discretization and concept hierarchy generation
Cluster Analysis

61
Regression

Y1

y=x+1
Y1’

x
X1

62
Data Transformation: Normalization

▪ Min-max normalization: to [new_minA, new_maxA]


v − minA
v' = (new _ maxA − new _ minA) + new _ minA
maxA − minA

▪ Ex. Let income range $12,000 to $98,000 normalized to


[0.0, 1.0]. Then $73,000 is mapped to

73,600 − 12,000
(1.0 − 0) + 0 = 0.716
98,000 − 12,000
Data Transformation: Normalization

▪ Z-score normalization (μ: mean, σ: standard deviation):


v − A
v' =
 A

▪ Ex. Let μ = 54,000, σ = 16,000. Then


73,600 − 54,000
= 1.225
16,000

▪ Normalization by decimal scaling

v Where j is the smallest integer such that Max(|ν’|) < 1


v'= j
10
Discretization

▪ Discretization
– Used to divide the range of continuous attribute into
intervals. Numerous continuous attribute values are
replaced by small interval labels.
– Ex. Create discreet categories below 18 yrs, 18-30yrs,…..

– Discretization can be done by binning, histogram analysis,


cluster, decision tree, correlation analyses
Concept Hierarchy Generation

▪ Concept Hierarchy Generation


– Used to reduce the data by collecting and replacing low-
level concepts with higher-level concepts.
– Here attributes are converted from level to higher level in
hierarchy. For Example-The attribute “city” can be converted
to “country”, age by young, middle-aged, or senior
15 distinct values
country

province_or_ state 365 distinct values

3567 distinct values


city

674,339 distinct values


street
Data Reduction

▪ Data reduction
– Obtain a reduced representation of the data set that is much
smaller in volume but yet produce the same (or almost the
same) analytical results
– Reduce some variables in table (Ex. Attribute Selection )
– Reduce some values in table (Ex. Discretization)
– Reduce some cases(rows) from the table (Ex. Sampling)

▪ Why data reduction?


– A database/data warehouse may store terabytes of data
– Complex data analysis/mining may take a very long time to
run on the complete data set
Data Reduction Strategies

▪ Data cube aggregation


– Aggregation operation is applied to data for the construction
of the data cube.
▪ Dimensionality reduction
– variable composition
– Variable Selection

▪ Data Compression
– Encoding mechanisms are used to reduce data set size

▪ Numerosity reduction
– Store the model of data instead of whole data, i.e fit data
into model. Ex. Histogram, clustering, sampling, regression
▪ Discretization and concept hierarchy generation
Data Cube Aggregation

▪ Multiple levels of aggregation in data cubes further reduces


the size of data to deal with
Data Compression

Original Data Compressed


Data
lossless

Original Data
Approximated
Major Techniques of Dimensionality
Reduction

▪ Feature reduction (Feature/ variable composition)


– Discarding the less important features and making new
synthetic features from linear combination of the original
ones.
– All original features used but in form of their linear
combinations
– Ex. Average miles driven per year = mileage/(current year
– year +1)

▪ Feature Selection (Attribute Subset /Variable Selection)


– Find a subset of original set of variables, or features, to get
a smaller subset which can be used to model the problem.
Greedy Method for Attribute Subset
Selection

All attributes
that do not
appear in tree
Keep adding
are irrelevant
best attributes
Numerosity Reduction

▪ Reduce data volume by choosing alternative, smaller forms


of data representation

▪ Parametric methods
– Assume the data fits some model, estimate model
parameters, store only the parameters, and discard the
data (except possible outliers)
– Example: Log-linear models

▪ Non-parametric methods
– Do not assume models
– Major families: histograms, clustering, sampling
Regression and Log-Linear Models

▪ Linear regression
– Data are modeled to fit a straight line
– Often uses the least-square method to fit the line

▪ Multiple regression
– allows a response variable Y to be modeled as a linear
function of multidimensional feature vector

▪ Log-linear model
– Approximates discrete multidimensional probability
distributions
Linear Regressions
Histograms

▪ Dividedata into buckets and store average (sum) for each


bucket. (Binning)
Histograms

▪ Partitioning
rules:
– Equal-width: equal bucket range
– Equal-frequency (or equal-depth)
Clustering

▪ Partition data set into clusters based on similarity, and store


cluster representation (e.g., centroid and diameter) only
▪ Can be very effective if data is clustered but not if data is
“dirty”
▪ Elements within the cluster are similar and elements
across the cluster are dissimilar
Sampling

▪A large data set to be represented by much smaller random


sample or subset.
▪ Stratified sampling- each type included
▪ Identifies representative subset to data
Types of Sampling

▪ Simple random sampling


– There is an equal probability of selecting any particular item
▪ Simple random sampling without replacement (SRSWOR)
– Once an object is selected, it is removed from the
population
▪ Simple random sampling with replacement (SRSWR)
– A selected object is not removed from the population
▪ Stratified sampling
– Partition the data set, and draw samples from each partition
(proportionally, i.e., approximately the same percentage of
the data)
Sampling: with or without Replacement
Types of Sampling

Raw Data
Discretization

▪ To transform a continuous attribute into a categorical attribute


– Divide the range of a continuous attribute into intervals
– Some classification algorithms only accept categorical
attributes. E.g. Apriori for ARM
– Reduce data size by discretization
– Prepare for further analysis
– Better results may be obtained with discretized attributes
▪ Three types of attributes:
– Nominal —unordered set, e.g., color, profession
– Ordinal — ordered set, e.g., military or academic rank
– Continuous — real numbers, e.g., integer or real numbers
Discretization
University Questions

▪ Issuesin data integration


▪ KDD process

You might also like