0% found this document useful (0 votes)
14 views51 pages

02data Part1

Chapter 2 of CS 412 discusses various types of data sets, including record, graph, ordered, and spatial data, and their characteristics. It covers data objects and attributes, basic statistical descriptions, hypothesis testing, and data visualization techniques. The chapter emphasizes the importance of understanding data similarity and correlation in data mining.

Uploaded by

Hansen Y
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views51 pages

02data Part1

Chapter 2 of CS 412 discusses various types of data sets, including record, graph, ordered, and spatial data, and their characteristics. It covers data objects and attributes, basic statistical descriptions, hypothesis testing, and data visualization techniques. The chapter emphasizes the importance of understanding data similarity and correlation in data mining.

Uploaded by

Hansen Y
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

CS 412 Intro.

to Data Mining
Chapter 2. Data and Measurements
Arindam B anerjee, Computer Science, UIU C, Fall 2024

1
Chapter 2. Getting to Know Your Data
❑ Data Objects and Attribute Types
❑ Basic Statistical Descriptions of Data
❑ Hypothesis Testing
❑ Data Visualization
❑ Measuring Data Similarity and Correlation
❑ Summary

2
Types of Data Sets: (1) Record Data
❑ Relational records
❑ Relational tables, highly structured
❑ Data matrix, e.g., numerical matrix, crosstabs

❑ Transaction data

timeout

season
coach

game
score
team

ball

lost
pla

wi
n
y
TID Items
1 Bread, Coke, Milk
2 Beer, Bread Document 1 3 0 5 0 2 6 0 2 0 2
3 Beer, Coke, Diaper, Milk
Document 2 0 7 0 2 1 0 0 3 0 0
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk Document 3 0 1 0 0 1 2 2 0 3 0

❑ Document data: Term-frequency vector (matrix) of text documents


3
Types of Data Sets: (2) Graphs and Networks
❑ Transportation network

❑ World Wide Web

❑ Molecular Structures

❑ Social or information networks


4
Types of Data Sets: (3) Ordered Data
❑ Video data: sequence of images

❑ Temporal data: time-series

❑ Sequential Data: transaction sequences

❑ Genetic sequence data


5
Types of Data Sets: (4) Spatial, image, and multimedia Data

❑ Spatial data

❑ Image data
(“TorchGeo”)

❑ Video data

6
Important Characteristics of Structured Data
❑ Dimensionality
❑ Curse of dimensionality
❑ Sparsity
❑ Only presence counts
❑ Resolution
❑ Patterns depend on the scale
❑ Distribution
❑ Centrality and dispersion

7
Data Objects
❑ Data sets are made up of data objects
❑ A data object represents an entity
❑ Examples:
❑ sales database: customers, store items, sales
❑ medical database: patients, treatments
❑ university database: students, professors, courses
❑ Also called samples , examples, instances, data points, objects, tuples
❑ Data objects are described by attributes
❑ Database rows → data objects; columns → attributes

8
Attributes
❑ Attribute (or dimensions, features, variables)
❑ A data field, representing a characteristic or feature of a data object.
❑ E.g., customer _ID, name, address
❑ Types:
❑ Nominal (e.g., red, blue)
❑ Binary (e.g., {true, false})
❑ Ordinal (e.g., {freshman, sophomore, junior, senior})
❑ Numeric: quantitative
❑ Interval-scaled: No true zero, can compute differences, means, etc.
❑ Examples: Temp in ○C or ○F, calendar year
❑ Ratio-scaled: True zero, ratio scaled, e.g., 10 is twice as much as 5
❑ Examples: Temp in ○K, years of experience, number of words
9
Attribute Types
❑ Nominal: categories, states, or “names of things”
❑ Hair_color = {auburn, black, blond, brown, grey, red, white}
❑ marital status, occupation, ID numbers, zip codes (though …)
❑ Binary
❑ Nominal attribute with only 2 states (0 and 1)
❑ Symmetric binary: both outcomes equally important
❑ e.g., tree is evergreen? loses leaves in winter or not
❑ Asymmetric binary: outcomes not equally important
❑ e.g., medical test (positive vs. negative)
❑ Convention: assign 1 to most important outcome (e.g., Covid positive)
❑ Ordinal
❑ Values have a meaningful order (ranking) but magnitude between successive
values is not known
❑ Size = {small, medium, large}, grades, rankings
10
Numeric Attribute Types
❑ Quantity (integer or real-valued)

❑ Interval

❑ Measured on a scale of equal-sized units


❑ Values have order
❑ E.g., temperature in C˚or F˚, calendar dates
❑ No true zero-point
❑ Ratio

❑ Inherent zero-point
❑ We can speak of values as being an order of magnitude larger than the unit
of measurement (10 K˚ is twice as high as 5 K˚).
❑ e.g., temperature in Kelvin, length, counts, monetary quantities
11
Discrete vs. Continuous Attributes
❑ Discrete Attribute
❑ Has only a finite or countably infinite set of values
❑ E.g., zip codes, profession, or the set of words in a collection of documents
❑ Sometimes, represented as integer variables
❑ Note: Binary attributes are a special case of discrete attributes
❑ Continuous Attribute
❑ Has real numbers as attribute values
❑ E.g., temperature, height, or weight
❑ Practically, real values can only be measured and represented using a finite
number of digits
❑ Continuous attributes are typically represented as floating-point variables
12
Chapter 2. Getting to Know Your Data
❑ Data Objects and Attribute Types
❑ Basic Statistical Descriptions of Data
❑ Hypothesis Testing
❑ Data Visualization
❑ Measuring Data Similarity and Correlation
❑ Summary

14
Basic Statistical Description of Data
❑ Central Tendency Measures
❑Mean
❑Median
Estimating Median by Interpolation
❑Mode

15
Measuring the Central Tendency: (1) Mean
❑Mean (sample vs. population):
Note: n is sample size and N is population size.

sample mean population mean

❑ Weighted arithmetic mean: n

w x i i
x= i =1
n

w
i =1
i
❑Trimmed mean:
❑ Chopping extreme values (e.g., Olympics gymnastics score computation)

16
Measuring the Central Tendency: (2) Median
❑ Middle value of a set of ordered values.
❑For even number values: average of the middle two values
❑Separates the higher half of a data set from the lower half
❑Extends to ordinal data
❑Computation is “somewhat expensive” for large data sets

17https://mammothmemory.net/maths/statistics-and-probability/averages/median.html
Estimating Median by Interpolation
❑ Median can be Estimated by interpolation (for grouped data)

n = 403

18
Estimating Median by Interpolation
❑ Compute the cumulative frequencies at the end of each bin

n = 403
36 90 159 241 296 339 364 386 403

𝑚
Bins: 1,2,…,M
𝐹𝑚 = ෍ 𝑓𝑙 Frequencies: f1,f2,…,fM
𝑙=1

19
Estimating Median by Interpolation
❑ Compute the cumulative frequencies at the end of each bin

Median falls in the mth bin with interval [Lm, Lm+1]


n = 403
𝑛 36 90 159 241 296 339 364 386 403
𝐹𝑚−1 ≤ ≤ 𝐹𝑚
2

𝑚
Bins: 1,2,…,M
𝐹𝑚 = ෍ 𝑓𝑙
Frequencies: f1,f2,…,fM
𝑙=1

20
Estimating Median by Interpolation
❑ Compute the cumulative frequencies at the end of each bin

Median falls in the mth bin with interval [Lm, Lm+1]


𝑛 n = 403
𝐹𝑚−1 ≤ ≤ 𝐹𝑚 36 90 159 241 296 339 364 386 403
2
sum before the median interval
approximate median
𝑛
− 𝐹𝑚−1
𝑚𝑒𝑑𝑖𝑎𝑛 ≈ 𝐿𝑚 + ( 2 ) × (𝐿𝑚+1 − 𝐿𝑚 )
𝑓𝑚
interval width
low interval limit

𝐹𝑚 = ෍ 𝑓𝑙 Bins: 1,2,…,M
𝑙=1 Frequencies: f1,f2,…,fM
21
Estimating Median by Interpolation
❑ Compute the cumulative frequencies at the end of each bin

Median falls in the mth bin with interval [Lm, Lm+1]


𝑛 n = 403
𝐹𝑚−1 ≤ ≤ 𝐹𝑚 36 90 159 241 296 339 364 386 403
2
sum before the median interval
approximate median
𝑛
− 𝐹𝑚−1
𝑚𝑒𝑑𝑖𝑎𝑛 ≈ 𝐿𝑚 + ( 2 ) × (𝐿𝑚+1 − 𝐿𝑚 )
𝑓𝑚
interval width
low interval limit

𝟐𝟎𝟐 − 159
25 + × (30 − 25) ≈ 27.62
82
𝑚
Bins: 1,2,…,M
𝐹𝑚 = ෍ 𝑓𝑙 Frequencies: f1,f2,…,fM
22 𝑙=1
Measuring the Central Tendency: (3) Mode
❑ Mode: Value that occurs most frequently in the data

❑Unimodal
❑ Empirical formula: mean − mode = 3  (mean − median)

❑Multi-modal
❑ Bimodal

❑ Trimodal

23
2
4
Symmetric, Positively and Negatively Skewed Data

Mode Mean Mean Mode


Mean
Median
Mode

Median Median

Positively skewed Symmetric Negatively skewed


RIGHT tail is extended LEFT tail is extended
24
Properties of Normal Distribution Curve
← — ————Represent data dispersion, spread — ————→

Represent central tendency


25
Variance and Standard Deviation
❑Variance and standard deviation (sample, population)
❑ Variance: (algebraic, scalable computation)
❑ Q: Can you compute it incrementally and efficiently?

Note: The subtle difference of


formulae for sample vs. population
• n : the size of the sample
• N : the size of the population

❑ Standard deviation is the square root of variance


26
Covariance of Numeric Data
❑ Two numeric attributes A, B, and n observations: {(a 1,b1),…,(an,bn)}
❑ Expected values:

❑ Covariance

❑ Correlation coefficient

27
Covariance vs. Correlation
❑ Covariance is sensitive to data scale
❑ May not provide insights about the strength of the relationship between attributes

A B A B
1 3 2 6
6 5 12 10
10 14 20 28
15 19 30 38
18 23 36 46

𝐶𝑜𝑣 𝐴, 𝐵 = 58 𝐶𝑜𝑣 𝐴, 𝐵 = 232


28
Correlation Coefficient of Numeric Data
❑ Correlation Coefficient

❑ rA,B lies in [-1,+1]


❑ Positive correlation: rA,B > 0, typically increase/decrease together
❑ Negative correlation: rA,B < 0, typically one increases when the other
decreases
❑ Uncorrelated: rA,B = 0, no correlation
❑ Correlation does not mean causation

29
Correlation Coefficient of Numeric Data
❑ rA,B lies in [-1,+1]
❑ Positive correlation
❑ rA,B > 0, typically increase/decrease together
❑ Negative correlation
❑rA,B < 0, typically one increases when the other decreases
❑ Uncorrelated: rA,B = 0, no correlation

30 https://www.simplypsychology.org/correlation.html
Chapter 2. Getting to Know Your Data
❑ Data Objects and Attribute Types
❑ Basic Statistical Descriptions of Data
❑ Hypothesis Testing
❑ Data Visualization
❑ Measuring Data Similarity and Correlation
❑ Summary

31
Example: Is My Coin Biased?
❑ Tossed coin 100 times: 54 Heads, 46 Tails
❑Is my coin biased?

❑ Measure deviation from expected behavior

❑ Can we say “no” (not biased) with probability at least 95% ?


❖… with probability at least 99% ?

32
What is Statistical Hypothesis Testing?
❑ Process of deciding whether observed outcomes are
❖ due to chance, or
❖ represent an actual effect

❑Example: Is a new drug effective in treating a certain disease?

❑ The goal of a hypothesis test is to make a decision about the population based on a
sample of data using two complementary hypotheses:
❑ Null Hypothesis H0 (e.g., coin is unbiased, drug is not effective, etc.)
❑ Alternative Hypothesis Ha or H1 (e.g., coin is biased, drug is effective, etc.)

33
Null Hypothesis
❑ Assumes No relationship between two variables

34
Alternative Hypothesis
❑ Contradicts the null hypothesis

❑ States something significant, not the default assumption


❖ Example: coin is biased, drug is effective, etc

35
Hypothesis Testing Procedure
❑ Hypothesis testing, using a sample of n observations

❑Consider a null hypothesis and an alternative hypothesis

❑Consider a test statistic and set a significance level

❑Calculate the test statistic from observations

❑Calculate the p-value, reject or fail to reject the null hypothesis

36
Type I and Type II Errors
❑ Type I error: Null hypothesis was correct, but was rejected
❑ Coin was unbiased, but was marked as biased
❑ Drug was ineffective, but was marked as effective
❑ Bound Type I error probability, denoted by ⍺
❑ ⍺ = 0.05 means Type I error will not happen with prob 0.95

❑ Type II error: Null hypothesis was incorrect, but was accepted


❑ Coin was biased, but was marked as unbiased
❑ Drug was effective, but was marked as ineffective

37
Example: Is My Coin Biased?
❑ Tossed coin 100 times: 54 Heads, 46 Tails
❑Is my coin biased?

❑ Measure deviation from expected behavior


❑Formulate the problem (deviation) in terms of a test statistic
❑Test statistic follows a known probability distribution (chi-squared, 𝒳 2 )
Under suitable assumptions

38
Example: Is My Coin Biased?
❑ Tossed a coin 100 times: 54 Heads, 46 Tails
❑ Is my coin biased?
❑ Hypothesis testing, using a sample of n=100 observations
❑ Null hypothesis: Coin is unbiased, expected behavior 50 Heads, 50 Tails
❑ Test statistic: measures deviation from expected behavior

❑ Test statistic follows the chi-square distribution


❑ Degrees of freedom k = 1
❑ Significance level: probability that the deviation is by random chance
❑ ⍺=0.05 or ⍺=0.01 are typical

39
2
𝒳 Distribution with 𝑘 Degrees of Freedom
𝑘
𝑄~𝜒𝑘2 ≡ 𝑄 = ෍ 𝑔𝑖2 , 𝑔𝑖 ~𝑁(0,1)
𝑖=1

𝒳 2 density function
40
Degrees of Freedom (DF)
❑ Number of independent values or categories in the data that are free to vary

❑ Tossing a coin
❑ Two categories, with one degree of freedom
Since total number of samples is known/fixed
❑ In general, for k categories, DF = k – 1
Urban Suburban Rural Total

❑ Contingency table Car 50 40 30 120


❑ (#𝑟𝑜𝑤𝑠 −1)∗(#𝑐𝑜𝑙𝑠 −1) Bus 40 60 20 120
Bicycle 20 15 5 40
Row sums, column sums are known 110 115 55
Total

Preferred Commute Method based on Location Sample size = 280

41
2
𝒳 Distribution with 𝑘 Degrees of Freedom
𝑘
𝑄~𝜒𝑘2 ≡ 𝑄 = ෍ 𝑔𝑖2 , 𝑔𝑖 ~𝑁(0,1)
𝑖=1
❑ Sample independently k normal
distributions (0,1), square and add
them together
❑ Small k: fast decay
❑ As k increases, peak of distribution
shifts towards the right

42
𝒳 2 density function
Example: Is My Coin Biased?
❑ Tossed a coin 100 times: 54 Heads, 46 Tails
❑ Is my coin biased
❑ Hypothesis testing, using a sample of n observations
❑ Null hypothesis: Coin is unbiased, expected behavior 50 Heads, 50 Tails
❑ Test statistic: measures deviation from expected behavior

❑ Test statistic follows the chi-square distribution


❑ Degrees of freedom k = 1
❑ Significance level: ⍺=0.05

43
How to Find p-value?
❑ Statistical test quantifies the difference between the observed frequencies
and the expected frequencies

❑ p-value represents the probability of obtaining a chi-square statistic at


least as extreme as the one computed from the sample data, assuming
that the null Hypothesis is true.

44
How to Find p-value (𝒳 2 Density Function)
❑ Compute the probability from 𝒳 2 distribution:
sample test statistic value is 0.64 or more by random chance (under the null hypothesis)
• Degrees of freedom = 1 and (𝛼 = 0.05)

𝒳 2 density function
45
4
6
How to Find p-value (𝒳 2 Density Function)
❑ Compute the probability from 𝒳 2 distribution:
sample test statistic value is 0.64 or more by random chance (under the null hypothesis)
• Degrees of freedom = 1 and (𝛼 = 0.05)

Area under the chi-square curve k=1 to the right


of the calculated chi-square statistic.

The density function is not the easiest choice!

0.64
𝒳 2 density function
46
How to Find p-value? (𝒳 2 Cumulative Distribution Function)
❑ Compute the probability from 𝒳 2 distribution:
sample test statistic value is 0.64 or more by random chance (under the null hypothesis)
• Degrees of freedom = 1 and (𝛼 = 0.05)

1 − ℙ(X≤ 0.64)
= 1 − 0.5763 = 0.4237

0.64

47
𝒳 2 Cumulative Distribution Function
How to Find p-value? (𝒳 2 p-value plot)
❑ Compute the probability from 𝒳 2 distribution:
sample test statistic value is 0.64 or more by random chance (under the null hypothesis)
• Degrees of freedom = 1 and (𝛼 = 0.05)

𝒳 2 Test Statistic
0.6
4

48
How to Find p-value? (𝒳 2 p-value plot)
❑ Compute the probability from 𝒳 2 distribution:
sample test statistic value is 0.64 or more by random chance (under the null hypothesis)
• Degrees of freedom = 1 and (𝛼 = 0.05)

k=1

= 0.4237

𝒳 2 Test Statistic
0.6
4

49
How to Find p-value? (Python)
❑ Compute the probability in 𝒳 2 distribution when the sample test statistic value is
0.64 by random chance under the null hypothesis
• Degrees of freedom = 1 and (𝛼 = 0.05)

from scipy.stats import chisquare

chisquare([54, 46], f_exp=[50,50])

Power_divergenceResult(statistic=0.64, pvalue=0.4237107971667936)

50
Example: Is My Coin Biased?
❑ Tossed coin 100 times: 54 Heads, 46 Tails
❑Is my coin biased?

p-value (0.4237) > ⍺ (0.05)

Fail to reject the null hypothesis

❑ Coin is unbiased

❖ There is a 5% chance of (Type I) error


The observed data is not sufficiently surprising under the null hypothesis

51

You might also like