data mining Concepts and Techniques
Dr. Atif Ali Mohamed
Assistant Professor … University of Science and Technology
ICT Head Department
Mobile: 0123393000 … 0912534290
Web side: www.dratifnimir.info
E-mail:
[email protected]© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1
1
Data Mining: Data set
What is Data set?
Types of data sets.
Data Quality.
Data Preprocessing.
What is Data set?
Collection of data objects
and their attributes Attributes
Tid Refund Marital Taxable
An attribute is a property or Status Income Cheat
characteristic of an object 1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
A collection of attributes
4 Yes Married 120K No
describe an object 5 No Divorced 95K Yes
Object is also known as Objects 6 No Married 60K No
record, point, case, sample, 7 Yes Divorced 220K No
entity, tuble, row, or 8 No Single 85K Yes
9 No Married 75K No
instance 10 No Single 90K Yes
10
Types of Attributes
There are different types of attributes
Nominal
Examples: ID numbers, eye color, zip codes
Ordinal
Examples: rankings (e.g., taste of potato chips on a scale
from 1-10), grades, height in {tall, medium, short}
Interval
Examples: calendar dates, temperatures in Celsius or
Fahrenheit.
Ratio
Examples: temperature in Kelvin, length, time, counts
Attribute Description Examples
Type
The values of a nominal attribute
Nominal are just different names, i.e.,
zip codes, employee
ID numbers, eye color,
nominal attributes provide only sex: {male, female}
enough information to distinguish
one object from another. (=, )
The values of an ordinal hardness of minerals,
Ordinal attribute provide enough {good, better, best},
grades, street numbers
information to order objects.
(<, >)
For interval attributes, the calendar dates,
Interval differences between values are temperature in Celsius
meaningful, i.e., a unit of or Fahrenheit
measurement exists.
(+, - )
For ratio variables, both temperature in Kelvin,
Ratio differences and ratios are
monetary quantities,
counts, age, mass,
meaningful. (*, /) length, electrical
current
Discrete and Continuous Attributes
Discrete Attribute
Has only a finite or countably infinite set of values
Examples: zip codes, counts, or the set of words in a collection of
documents
Often represented as integer variables.
Note: binary attributes are a special case of discrete attributes
Continuous Attribute
Has real numbers as attribute values
Examples: temperature, height, or weight.
Practically, real values can only be measured and represented using a
finite number of digits.
Continuous attributes are typically represented as floating-point
variables.
Types of data sets
Record
Data Matrix
Important Characteristics of
Document Data
Structured Data:
Transaction Data
Graph –Dimensionality
World Wide Web Curse of Dimensionality
Molecular Structures
–Sparsity
Ordered Only presence counts
Spatial Data
Temporal Data
–Resolution
Sequential Data Patterns depend on the scale
Genetic Sequence Data
Record Data
Data that consists of a collection of records, each of
which consists of a fixed set of attributes
Tid Refund Marital Taxable
Status Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Data Matrix
If data objects have the same fixed set of numeric attributes,
then the data objects can be thought of as points in a multi-
dimensional space, where each dimension represents a distinct
attribute
Such data set can be represented by an m by n matrix, where
there are m rows, one for each object, and n columns, one for
each attribute
Projection Projection Distance Load Thickness
of x Load of y load
10.23 5.27 15.22 2.7 1.2
12.65 6.25 16.22 2.2 1.1
Document Data
Each document becomes a `term' vector,
Each term is a component (attribute) of the vector,
The value of each component is the number of times the
corresponding term occurs in the document.
timeout
season
coach
game
score
team
ball
lost
pla
wi
n
y
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
Transaction Data
A special type of record data, where
each record (transaction) involves a set of items.
For example, consider a grocery store. The set of products
purchased by a customer during one shopping trip constitute
a transaction, while the individual products that were
purchased are the items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Graph Data
Examples: Generic graph and HTML Links
<a href="papers/papers.html#bbbb">
Data Mining </a>
<li>
<a href="papers/papers.html#aaaa">
2 Graph Partitioning </a>
<li>
<a href="papers/papers.html#aaaa">
5 1 Parallel Solution of Sparse Linear System of Equations </a>
<li>
<a href="papers/papers.html#ffff">
2 N-Body Computation and Dense Linear System Solvers
5
Data Quality
What kinds of data quality problems?
How can we detect problems with the data?
What can we do about these problems?
Examples of data quality problems:
Noise and outliers
missing values
duplicate data
Noise
Noise refers to modification of original values
Examples: distortion of a person’s voice when talking on
a poor phone and “snow” on television screen
Two Sine Waves Two Sine Waves + Noise
Outliers
Outliers are data objects with characteristics that are
considerably different than most of the other data
objects in the data set
Missing Values
Reasons for missing values
Information is not collected
(e.g., people decline to give their age and weight)
Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)
Handling missing values
Eliminate Data Objects
Estimate Missing Values
Ignore the Missing Value During Analysis
Replace with all possible values (weighted by their
probabilities)
Duplicate Data
Data set may include data objects that are
duplicates, or almost duplicates of one another
Major issue when merging data from heterogeous
sources
Examples:
Same person with multiple email addresses
Data cleaning
Process of dealing with duplicate data issues
Data Preprocessing
Aggregation
Sampling
Dimensionality Reduction
Feature subset selection
Feature creation
Discretization and Binarization
Attribute Transformation
Aggregation
Combining two or more attributes (or objects) into
a single attribute (or object)
Purpose
Data reduction
Reduce the number of attributes or objects
Change of scale
Cities aggregated into regions, states, countries, etc
More “stable” data
Aggregated data tends to have less variability
Sampling
Sampling is the main technique employed for data selection.
It is often used for both the preliminary investigation of the
data and the final data analysis.
Statisticians sample because obtaining the entire set of data of
interest is too expensive or time consuming.
Sampling is used in data mining because processing the entire
set of data of interest is too expensive or time consuming.
The key principle for effective sampling is the following:
using a sample will work almost as well as using the entire
data sets, if the sample is representative
A sample is representative if it has approximately the same
property (of interest) as the original set of data
Types of Sampling
Simple Random Sampling
There is an equal probability of selecting any particular item
Sampling without replacement
As each item is selected, it is removed from the population
Sampling with replacement
Objects are not removed from the population as they are selected for
the sample.
In sampling with replacement, the same object can be picked up more
than once
Stratified sampling
Split the data into several partitions; then draw random samples
from each partition
Curse of Dimensionality
When dimensionality
increases, data becomes
increasingly sparse in the
space that it occupies
Definitions of density
and distance between
points, which is critical
for clustering and outlier
• Randomly generate 500 points
detection, become less
meaningful • Compute difference between max
and min distance between any pair
of points
Dimensionality Reduction
Purpose:
Avoid curse of dimensionality
Reduce amount of time and memory required by data mining
algorithms
Allow data to be more easily visualized
May help to eliminate irrelevant features or reduce noise
Techniques
Principle Component Analysis (PCA)
Singular Value Decomposition
Others: supervised and non-linear techniques
Feature Subset Selection
Another way to reduce dimensionality of data
Redundant features
duplicate much or all of the information contained in one
or more other attributes
Example: purchase price of a product and the amount of
sales tax paid
Irrelevant features
contain no information that is useful for the data mining
task at hand
Example: students' ID is often irrelevant to the task of
predicting students' GPA
Feature Subset Selection
Techniques:
Brute-force approach:
Try all possible feature subsets as input to data mining
algorithm
Embedded approaches:
Feature selection occurs naturally as part of the data mining
algorithm
Filter approaches:
Features are selected before data mining algorithm is run
Wrapper approaches:
Use the data mining algorithm as a black box to find best
subset of attributes
Feature Creation
Create new attributes that can capture the
important information in a data set much more
efficiently than the original attributes
Three general methodologies:
Feature Extraction
domain-specific
Mapping Data to New Space
Feature Construction
combining features
End of the Lecture
Thanks
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 27