1
CS822
Data
Mining
Instructor: Dr. Muhammad Tahir
2
Data Objects and Attributes Types
• Data sets are made up of data objects.
• A data object represents an entity—
• in a sales database, the objects may be customers, store
items, and sales;
• in a medical database, the objects may be patients;
• in a university database, the objects may be students,
professors, and courses.
• Data objects are typically described or represented by
attributes.
• Data objects can also be referred to as samples, examples,
instances, data points, or objects.
3
Attributes
• An attribute is a data field, representing a characteristic or
feature of a data object.
• The nouns attribute, dimension, feature, and variable are
often used interchangeably in the literature.
• The term dimension is commonly used in data
warehousing.
• Machine learning literature tends to use the term
feature.
• Statisticians prefer the term variable.
• Data mining and database professionals commonly use
the term attribute.
4
Attributes Types
• The type of an attribute is determined by the set of
possible values the attribute can have.
• These are the four types:
1) Nominal Attributes:
2) Binary Attributes
3) Ordinal Attributes
4) Numeric Attributes
• Interval-Scaled Attributes
• Ratio-Scaled Attributes
5
Attributes Types
• The type of an attribute is determined by the set of possible values
the attribute can have. These are the four types:
1) Nominal Attributes: Each value represents some kind of
category, code, or state, and so nominal attributes are also
referred to as categorical. The values do not have any
meaningful order. E.g. Hair color, marital status, occupation, ID
numbers, zip codes
2) Binary Attributes: A binary attribute is a nominal attribute
with only two categories or states: 0 or 1, where 0 typically
means that the attribute is absent, and 1 means that it is
present. Binary attributes are referred to as Boolean if the two
states correspond to true and false. E.g. Medical test result or
gender
6
Attributes Types
3) Ordinal Attributes: an attribute with possible values
that have a meaningful order or ranking among them,
but the magnitude between successive values is not
known. E.g.
• Size = {small, medium, large}
• Grades = {A, B, C, D, F}
• Army rankings … Etc
7
Attributes Types
4) Numeric Attributes: is quantitative; that is, it is a
measurable quantity, represented in integer or real
values. Numeric attributes can be interval-scaled or
ratio-scaled.
• Interval-Scaled Attributes are measured on a
scale of equal-size units. No true zero-point. E.g.
temperature in C˚or F˚, calendar dates.
• Ratio-Scaled Attributes are numeric attribute with
an inherent zero-point. E.g., area, weight, height,
length, counts, monetary quantities. Ratio between
two data object’s attribute can be calculated.
8
Discrete vs. Continuous Attributes
• Discrete Attribute (Nominal, Binary and Ordinal)
• Has only a finite or countably infinite set of values E.g., zip
codes, profession, or the set of words in a collection of
documents
• Sometimes, represented as integer variables
• Continuous or Numeric Attribute (Ratio and Interval)
• Has real numbers as attribute values
• E.g., temperature, height, or weight
• Practically, real values can only be measured and represented
using a finite number of digits
• Continuous attributes are typically represented as floating-
point variables
9
Basic Statistical Descriptions of
Data
10
Basic Statistical Descriptions of Data
• Motivation
• To better understand the data
• How?
• by measuring data’s central tendency and distribution
(variation and spread).
• Measuring the Central Tendency characteristics
• Mean, Median, Mode and Midrange.
• Measuring the Data dispersion (or distribution) characteristics
• Range, max, min, quantiles, outliers, variance and standard
deviation.
11
Basic Statistical Descriptions of Data
Measuring the Central Tendency characteristics
• Mean is average value for all the data also is the center
of data. It calculated by dividing the sum of all values
over the sample size. 1 n
x x
n
i 1
i
• Trimmed mean
• The mean can also be calculated on a trimmed data
by removing the extreme values.
• Weighted average or Weighted arithmetic mean
• Differ from regular mean by giving each value n
a
w
weight that reflect its significance or importance.
x
i i
x i 1
n
w
i 1
i 12
Basic Statistical Descriptions of Data
Measuring the Central Tendency characteristics
• Median
• After sorting the data, the median is the middle value
if the size of data is an odd number otherwise the
sum of the two middle numbers divided by 2.
• Sorting can be computationally expensive. However,
without sorting we can approximate the value.
13
Basic Statistical Descriptions of Data
Measuring the Central Tendency characteristics
• Mode is a value that occurs most frequently in the data
• Sometimes we have multiple values with the same
highest frequency. (Unimodal or Multimodel e.g.
Bimodal, Trimodal)
• Only one value with highest frequency =
Unimodel
• Two values with highest and equally frequent
values = bimodal
• Three values with highest and most frequent
values = trimodal
14
Basic Statistical Descriptions of Data
Measuring the Central Tendency characteristics
• Midrange is another measure of central tendency. It is
simply the average of the min and max values of the
data.
• This is easy to compute using the SQL aggregate
functions, max() and min().
• When data have a symmetric distribution all central
tendency measure return the same center value.
• But data usually do not!
15
Measuring the Central Tendency
characteristics – Example
• Suppose we have the following values for salary (in thousands of
dollars), shownn
in increasing order: 30, 36, 47, 50, 52, 52, 56, 60,
63, 70, 1
x 70,
110.
xi
• Mean n i 1
• Trimmed mean
• In this example, remove 30, 36 and 110. Then, recalculate.
• Median
• Mode
• 52 and 70 are the modes (bimodal)
• Midrange 16
Basic Statistical Descriptions of Data
Measuring the Central Tendency characteristics
symmetric positively skewed negatively
skewed
17
Basic Statistical Descriptions of Data
Measuring the Dispersion (distribution) of Data
• Quartiles
• Quartiles divide a dataset into four equal parts.
• They help understand the spread and distribution of data.
• There are 3 quartiles:
• Q1 (First Quartile): 25% of data lies below Q1.
• Q2 (Second Quartile): This is the Median — 50% of data lies below Q2.
• Q3 (Third Quartile): 75% of data lies below Q3.
• Why we use Quartiles?
• To measure spread and central tendency.
• To identify where a data point falls in the dataset.
• Useful in creating Boxplots and detecting Outliers. 18
Basic Statistical Descriptions of Data
Measuring the Dispersion (distribution) of Data
• Outliers:
• Outliers are data points that lie far away from most of the data.
• They are typically identified using the Interquartile Range (IQR).
• Formula to detect outliers:
• Lower Bound = Q1−1.5 × IQR
• Upper Bound = Q3+1.5 × IQR
• (Where IQR=Q3−Q1)
• Why we use Outliers detection?
• Outliers can skew data analysis and affect measures like mean
and standard deviation.
• Detecting and handling outliers is crucial for accurate analysis
19
and reliable models.
Basic Statistical Descriptions of Data
Measuring the Dispersion (distribution) of Data
• Boxplots:
• A boxplot (or box-and-whisker plot) is a visual summary of data
distribution.
• It displays:
• Minimum value (excluding outliers)
• Q1 (First Quartile)
• Median (Q2)
• Q3 (Third Quartile)
• Maximum value (excluding outliers)
• Outliers (marked as points outside the whiskers)
• Why we use Boxplots?
• Provides a clear visual summary of data spread and central tendency.
• Helps compare distributions between datasets.
• Makes it easy to spot outliers.
20
Basic Statistical Descriptions of Data
Measuring the Dispersion (distribution) of Data
• Variance and standard deviation (sample: s,
population: σ)
• Variance: (algebraic,
2 1 n
scalable
21 n
computation)
(x ) x 2 2
i i
Ni 1 N i 1
• Standard deviation s (or σ) is the square root of
variance s2 (or )
21
Dispersion (distribution) of Data
• Popular visualization plots visualize data
distribution
• Boxplot: graphic display of five-number summary
(min (excluding outliers), Q1, median, Q3, max
(excluding outliers))
• Histogram: x-axis are values, y-axis represent
frequencies
• Scatter plot: each pair of values is a pair of
coordinates and plotted as points in the plane
22
Boxplot
• Five-number summary of a distribution
• Minimum, Q1, Median, Q3, Maximum
• Boxplot
• Data is represented with a box
• The ends of the box are at the first and
third quartiles, i.e., the height of the box
is IQR
• The median is marked by a line within the
box
• Whiskers: two lines outside the box are
extended to Minimum and Maximum
• Outliers: points beyond a specified outlier
threshold, plotted individually 23
Histogram
• Histograms (or frequency histograms) are at least a
century old and are widely used.
• The height of the bar indicates the frequency (i.e.,
count) of the values that fill within range of the bar. The
resulting graph is more commonly known as a bar chart.
24
Scatter Plot
• Provides a first look at bivariate data to see clusters of
points, outliers, etc
• Each pair of values is treated as a pair of coordinates
and plotted as points in the plane
25
Scatter Plot – Correlation
Positively Negatively No Correlation
Correlated Correlated
26
Data Visualization
• Data visualization
• aims to communicate data clearly and effectively
through graphical representation.
• Data visualization has been
• used extensively in many applications—for example, at
work for reporting, managing business operations, and
tracking progress of tasks.
• used to discover data relationships that are otherwise not
easily observable by looking at the raw data.
• Provide a visual proof of computer representations derived
27
You are welcome
28