CS822 DataMining Week2

The document provides an overview of data mining concepts, focusing on data objects, attributes, and their types. It categorizes attributes into nominal, binary, ordinal, and numeric types, and discusses statistical descriptions of data, including measures of central tendency and dispersion. Additionally, it covers visualization techniques such as boxplots, histograms, and scatter plots to effectively communicate data insights.

Uploaded by

zainab zahid

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views28 pages

CS822 DataMining Week2

Uploaded by

zainab zahid

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 28

1

CS822
Data
Mining
Instructor: Dr. Muhammad Tahir

2
Data Objects and Attributes Types
• Data sets are made up of data objects.
• A data object represents an entity—
• in a sales database, the objects may be customers, store
items, and sales;
• in a medical database, the objects may be patients;
• in a university database, the objects may be students,
professors, and courses.
• Data objects are typically described or represented by
attributes.
• Data objects can also be referred to as samples, examples,
instances, data points, or objects.
3
Attributes
• An attribute is a data field, representing a characteristic or
feature of a data object.
• The nouns attribute, dimension, feature, and variable are
often used interchangeably in the literature.
• The term dimension is commonly used in data
warehousing.
• Machine learning literature tends to use the term
feature.
• Statisticians prefer the term variable.
• Data mining and database professionals commonly use
the term attribute.
4
Attributes Types
• The type of an attribute is determined by the set of
possible values the attribute can have.
• These are the four types:
1) Nominal Attributes:
2) Binary Attributes
3) Ordinal Attributes
4) Numeric Attributes
• Interval-Scaled Attributes
• Ratio-Scaled Attributes

5
Attributes Types
• The type of an attribute is determined by the set of possible values
the attribute can have. These are the four types:
1) Nominal Attributes: Each value represents some kind of
category, code, or state, and so nominal attributes are also
referred to as categorical. The values do not have any
meaningful order. E.g. Hair color, marital status, occupation, ID
numbers, zip codes
2) Binary Attributes: A binary attribute is a nominal attribute
with only two categories or states: 0 or 1, where 0 typically
means that the attribute is absent, and 1 means that it is
present. Binary attributes are referred to as Boolean if the two
states correspond to true and false. E.g. Medical test result or
gender
6
Attributes Types
3) Ordinal Attributes: an attribute with possible values
that have a meaningful order or ranking among them,
but the magnitude between successive values is not
known. E.g.
• Size = {small, medium, large}
• Grades = {A, B, C, D, F}
• Army rankings … Etc

7
Attributes Types
4) Numeric Attributes: is quantitative; that is, it is a
measurable quantity, represented in integer or real
values. Numeric attributes can be interval-scaled or
ratio-scaled.
• Interval-Scaled Attributes are measured on a
scale of equal-size units. No true zero-point. E.g.
temperature in C˚or F˚, calendar dates.
• Ratio-Scaled Attributes are numeric attribute with
an inherent zero-point. E.g., area, weight, height,
length, counts, monetary quantities. Ratio between
two data object’s attribute can be calculated.
8
Discrete vs. Continuous Attributes
• Discrete Attribute (Nominal, Binary and Ordinal)
• Has only a finite or countably infinite set of values E.g., zip
codes, profession, or the set of words in a collection of
documents
• Sometimes, represented as integer variables
• Continuous or Numeric Attribute (Ratio and Interval)
• Has real numbers as attribute values
• E.g., temperature, height, or weight
• Practically, real values can only be measured and represented
using a finite number of digits
• Continuous attributes are typically represented as floating-
point variables
9
Basic Statistical Descriptions of
Data

10
Basic Statistical Descriptions of Data
• Motivation
• To better understand the data
• How?
• by measuring data’s central tendency and distribution
(variation and spread).
• Measuring the Central Tendency characteristics
• Mean, Median, Mode and Midrange.
• Measuring the Data dispersion (or distribution) characteristics
• Range, max, min, quantiles, outliers, variance and standard
deviation.

11
Basic Statistical Descriptions of Data
Measuring the Central Tendency characteristics
• Mean is average value for all the data also is the center
of data. It calculated by dividing the sum of all values
over the sample size. 1 n
x x
n

i 1
i

• Trimmed mean
• The mean can also be calculated on a trimmed data
by removing the extreme values.
• Weighted average or Weighted arithmetic mean
• Differ from regular mean by giving each value n
a
 w
weight that reflect its significance or importance.
x
i i
x  i 1
n

w
i 1
i 12
Basic Statistical Descriptions of Data
Measuring the Central Tendency characteristics
• Median
• After sorting the data, the median is the middle value
if the size of data is an odd number otherwise the
sum of the two middle numbers divided by 2.
• Sorting can be computationally expensive. However,
without sorting we can approximate the value.

13
Basic Statistical Descriptions of Data
Measuring the Central Tendency characteristics
• Mode is a value that occurs most frequently in the data
• Sometimes we have multiple values with the same
highest frequency. (Unimodal or Multimodel e.g.
Bimodal, Trimodal)
• Only one value with highest frequency =
Unimodel
• Two values with highest and equally frequent
values = bimodal
• Three values with highest and most frequent
values = trimodal

14
Basic Statistical Descriptions of Data
Measuring the Central Tendency characteristics
• Midrange is another measure of central tendency. It is
simply the average of the min and max values of the
data.
• This is easy to compute using the SQL aggregate
functions, max() and min().

• When data have a symmetric distribution all central

tendency measure return the same center value.
• But data usually do not!

15
Measuring the Central Tendency
characteristics – Example
• Suppose we have the following values for salary (in thousands of
dollars), shownn
in increasing order: 30, 36, 47, 50, 52, 52, 56, 60,
63, 70, 1
x 70,
 110.
 xi
• Mean n i 1

• Trimmed mean
• In this example, remove 30, 36 and 110. Then, recalculate.
• Median

• Mode
• 52 and 70 are the modes (bimodal)
• Midrange 16
Basic Statistical Descriptions of Data
Measuring the Central Tendency characteristics

symmetric positively skewed negatively

skewed

17
Basic Statistical Descriptions of Data
Measuring the Dispersion (distribution) of Data
• Quartiles
• Quartiles divide a dataset into four equal parts.
• They help understand the spread and distribution of data.
• There are 3 quartiles:
• Q1 (First Quartile): 25% of data lies below Q1.
• Q2 (Second Quartile): This is the Median — 50% of data lies below Q2.
• Q3 (Third Quartile): 75% of data lies below Q3.
• Why we use Quartiles?
• To measure spread and central tendency.
• To identify where a data point falls in the dataset.
• Useful in creating Boxplots and detecting Outliers. 18
Basic Statistical Descriptions of Data
Measuring the Dispersion (distribution) of Data
• Outliers:
• Outliers are data points that lie far away from most of the data.
• They are typically identified using the Interquartile Range (IQR).
• Formula to detect outliers:
• Lower Bound = Q1−1.5 × IQR
• Upper Bound = Q3+1.5 × IQR
• (Where IQR=Q3−Q1)
• Why we use Outliers detection?
• Outliers can skew data analysis and affect measures like mean
and standard deviation.
• Detecting and handling outliers is crucial for accurate analysis
19
and reliable models.
Basic Statistical Descriptions of Data
Measuring the Dispersion (distribution) of Data
• Boxplots:
• A boxplot (or box-and-whisker plot) is a visual summary of data
distribution.
• It displays:
• Minimum value (excluding outliers)
• Q1 (First Quartile)
• Median (Q2)
• Q3 (Third Quartile)
• Maximum value (excluding outliers)
• Outliers (marked as points outside the whiskers)
• Why we use Boxplots?
• Provides a clear visual summary of data spread and central tendency.
• Helps compare distributions between datasets.
• Makes it easy to spot outliers.
20
Basic Statistical Descriptions of Data
Measuring the Dispersion (distribution) of Data
• Variance and standard deviation (sample: s,
population: σ)
• Variance: (algebraic,
2 1 n
scalable
21 n
computation)
   (x   )   x   2 2
i i
Ni 1 N i 1

• Standard deviation s (or σ) is the square root of

variance s2 (or )

21
Dispersion (distribution) of Data

• Popular visualization plots visualize data

distribution
• Boxplot: graphic display of five-number summary
(min (excluding outliers), Q1, median, Q3, max
(excluding outliers))
• Histogram: x-axis are values, y-axis represent
frequencies
• Scatter plot: each pair of values is a pair of
coordinates and plotted as points in the plane
22
Boxplot
• Five-number summary of a distribution
• Minimum, Q1, Median, Q3, Maximum
• Boxplot
• Data is represented with a box
• The ends of the box are at the first and
third quartiles, i.e., the height of the box
is IQR
• The median is marked by a line within the
box
• Whiskers: two lines outside the box are
extended to Minimum and Maximum
• Outliers: points beyond a specified outlier
threshold, plotted individually 23
Histogram
• Histograms (or frequency histograms) are at least a
century old and are widely used.
• The height of the bar indicates the frequency (i.e.,
count) of the values that fill within range of the bar. The
resulting graph is more commonly known as a bar chart.

24
Scatter Plot
• Provides a first look at bivariate data to see clusters of
points, outliers, etc
• Each pair of values is treated as a pair of coordinates
and plotted as points in the plane

25
Scatter Plot – Correlation

Positively Negatively No Correlation

Correlated Correlated
26
Data Visualization
• Data visualization
• aims to communicate data clearly and effectively
through graphical representation.
• Data visualization has been
• used extensively in many applications—for example, at
work for reporting, managing business operations, and
tracking progress of tasks.
• used to discover data relationships that are otherwise not
easily observable by looking at the raw data.
• Provide a visual proof of computer representations derived

27
You are welcome

Getting To Know Your Data
No ratings yet
Getting To Know Your Data
78 pages
02data Edited v2
No ratings yet
02data Edited v2
43 pages
Data Analysis & Visualization Guide
No ratings yet
Data Analysis & Visualization Guide
63 pages
Data Mining 1
No ratings yet
Data Mining 1
29 pages
02data DMDW
No ratings yet
02data DMDW
40 pages
Transportation Data Mining: Chapter 2. Getting To Know Your Data
No ratings yet
Transportation Data Mining: Chapter 2. Getting To Know Your Data
77 pages
Data-Preprocessing
No ratings yet
Data-Preprocessing
138 pages
02 Data
No ratings yet
02 Data
36 pages
About Data
No ratings yet
About Data
25 pages
Unit 1b
No ratings yet
Unit 1b
69 pages
02 Data
No ratings yet
02 Data
64 pages
Unit 3 Data Preprocessing - Data
No ratings yet
Unit 3 Data Preprocessing - Data
90 pages
Module 1
No ratings yet
Module 1
64 pages
02 Data
No ratings yet
02 Data
62 pages
DM Unit-1-1
No ratings yet
DM Unit-1-1
56 pages
02 Kinds of Data
No ratings yet
02 Kinds of Data
41 pages
2 1 Data
No ratings yet
2 1 Data
22 pages
IT326 - Ch2
No ratings yet
IT326 - Ch2
44 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
36 pages
CH 2
No ratings yet
CH 2
68 pages
Lecture 2 - Exploratory Data Analysis
No ratings yet
Lecture 2 - Exploratory Data Analysis
35 pages
Lect 3
No ratings yet
Lect 3
51 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
46 pages
Data Analysts-1
No ratings yet
Data Analysts-1
65 pages
02 Data
No ratings yet
02 Data
35 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
54 pages
Chapter 2
No ratings yet
Chapter 2
65 pages
Data Preprocessing Data Basics
No ratings yet
Data Preprocessing Data Basics
86 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
29 pages
Chapter 2 - Tagged
No ratings yet
Chapter 2 - Tagged
66 pages
Lec.02 Getting To Know Your Data
No ratings yet
Lec.02 Getting To Know Your Data
62 pages
Data Mining 2
No ratings yet
Data Mining 2
64 pages
Chapter 2
No ratings yet
Chapter 2
53 pages
Lec 2
No ratings yet
Lec 2
26 pages
1 L2 Intro DAM
No ratings yet
1 L2 Intro DAM
27 pages
Data ch2
No ratings yet
Data ch2
16 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
VIPDMTheory Chapter 2
No ratings yet
VIPDMTheory Chapter 2
56 pages
02 Data
No ratings yet
02 Data
24 pages
02data - 7 7 25
No ratings yet
02data - 7 7 25
63 pages
Unit1 Statistics
No ratings yet
Unit1 Statistics
60 pages
CH 2
No ratings yet
CH 2
35 pages
Data Distribution
No ratings yet
Data Distribution
26 pages
02 Data
No ratings yet
02 Data
41 pages
Week 2 - 3getting To Know Your Data
No ratings yet
Week 2 - 3getting To Know Your Data
67 pages
DWDM LS2 Fall 24 25
No ratings yet
DWDM LS2 Fall 24 25
42 pages
02 Data
No ratings yet
02 Data
66 pages
Lecture 2
No ratings yet
Lecture 2
62 pages
02 Data
No ratings yet
02 Data
65 pages
Week2 UnderstandingData
No ratings yet
Week2 UnderstandingData
27 pages
Data Science - Unit 2
No ratings yet
Data Science - Unit 2
57 pages
02know Your Data-Lecture2-3
No ratings yet
02know Your Data-Lecture2-3
53 pages
02know Your Data Lecture2 3
No ratings yet
02know Your Data Lecture2 3
53 pages
02 KnowYourData
No ratings yet
02 KnowYourData
44 pages
Focp 343162 Zainab Zahid Assignment 4
No ratings yet
Focp 343162 Zainab Zahid Assignment 4
10 pages
FOCP-343162-ZAINAB-ZAHID-Assignment 3
No ratings yet
FOCP-343162-ZAINAB-ZAHID-Assignment 3
8 pages
CS822 DataMining Week4
No ratings yet
CS822 DataMining Week4
45 pages
Papers
No ratings yet
Papers
17 pages
SPC 2101 Introduction To Computer Programming Concepts Year I Semester II
No ratings yet
SPC 2101 Introduction To Computer Programming Concepts Year I Semester II
2 pages
PowerBIEmbeddedConfiguration PDF
No ratings yet
PowerBIEmbeddedConfiguration PDF
98 pages
Running BGP in Data Centers at Scale Final
No ratings yet
Running BGP in Data Centers at Scale Final
17 pages
Full Stack Engineer Profile
No ratings yet
Full Stack Engineer Profile
2 pages
CRD5103 A Owner Manual Engb Es PTBR
No ratings yet
CRD5103 A Owner Manual Engb Es PTBR
108 pages
MD-102 Exam - Free Actual Q&as, Page 2 - ExamTopics
100% (1)
MD-102 Exam - Free Actual Q&as, Page 2 - ExamTopics
45 pages
Summary Chapter 3 "Managing Digital Business Infrastructure"
No ratings yet
Summary Chapter 3 "Managing Digital Business Infrastructure"
2 pages
Vc-02 v1.0.0 Specification 516
No ratings yet
Vc-02 v1.0.0 Specification 516
16 pages
Best Practices For Team-Based Development
No ratings yet
Best Practices For Team-Based Development
4 pages
Machine Learning Unit4
No ratings yet
Machine Learning Unit4
8 pages
Introduction To Artificial Intelligence
No ratings yet
Introduction To Artificial Intelligence
26 pages
Python Programming Question Bank
No ratings yet
Python Programming Question Bank
18 pages
Mold Maintenance Best Practices
No ratings yet
Mold Maintenance Best Practices
1 page
Microcontroller Selection for IoT Projects
No ratings yet
Microcontroller Selection for IoT Projects
30 pages
53TW CV Applying For Job
No ratings yet
53TW CV Applying For Job
7 pages
The Perfect Prompt A Prompt Engineering Cheat Sheet by Maximilian Vogel The Generator Apr 2024 Medium
No ratings yet
The Perfect Prompt A Prompt Engineering Cheat Sheet by Maximilian Vogel The Generator Apr 2024 Medium
22 pages
Comparison of Objective Image Quality Metrics To Expert Radiologists Scoring of Diagnostic Quality of MR Images
No ratings yet
Comparison of Objective Image Quality Metrics To Expert Radiologists Scoring of Diagnostic Quality of MR Images
9 pages
PCS7v6 Siprotec 7SJ6x Mapping3 4 v1 0 en PDF
No ratings yet
PCS7v6 Siprotec 7SJ6x Mapping3 4 v1 0 en PDF
41 pages
Cse121 - Orientation To Computingii 1
No ratings yet
Cse121 - Orientation To Computingii 1
36 pages
Onlinevarsity Registration Guide
No ratings yet
Onlinevarsity Registration Guide
5 pages
ALC Unit-4
No ratings yet
ALC Unit-4
15 pages
Speech-Controlled ATM Simulator Project
No ratings yet
Speech-Controlled ATM Simulator Project
146 pages
ThinkPad Mobile Internet - ArchWiki Seting Modem Thinkpad t440
No ratings yet
ThinkPad Mobile Internet - ArchWiki Seting Modem Thinkpad t440
6 pages
Selenium WebDriver: Advanced Actions & Frames
No ratings yet
Selenium WebDriver: Advanced Actions & Frames
2 pages
Mastering Data Visualization Techniques (Part 1)
No ratings yet
Mastering Data Visualization Techniques (Part 1)
20 pages
Object Oriented Programming OOPs - CS3391 - Hand Written Notes - Unit 1 - Introduction To OOP and Java
No ratings yet
Object Oriented Programming OOPs - CS3391 - Hand Written Notes - Unit 1 - Introduction To OOP and Java
67 pages
Hitesh Resume 280224
No ratings yet
Hitesh Resume 280224
3 pages
Cal 9900 Manual
No ratings yet
Cal 9900 Manual
7 pages
GEN SM-T230 Galaxy Tab 4 KK English User Manual NC4 F3
No ratings yet
GEN SM-T230 Galaxy Tab 4 KK English User Manual NC4 F3
121 pages
TOGAF 9 Template - Interface Catalog
No ratings yet
TOGAF 9 Template - Interface Catalog
19 pages