0% found this document useful (0 votes)

55 views42 pages

Features

Uploaded by

mert

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

55 views42 pages

Features

Uploaded by

mert

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

CMPE 442 Introduction to

Machine Learning
• Features
Features

 Data set- set of data objects

 Data object – represents an entity described by a set of attributes
 A feature is a data field representing a characteristic of a data object
 Workhorses of ML
 Mapping from instance space to the feature space
 Distinguish features by:
 domain types
 range of permissible operations
Features

 Workhorses of ML
 Mapping from instance space to the feature space
 Distinguish features by:
 domain types
 range of permissible operations
 Ex: Consider two features: person’s age and house number: while both are
numbers, house number is an ordinal hence taking the average is
meaningless.
 What matters is not just the domain of a feature but also the range of
permissible operations.
Getting to Know Your Data

 What are the types of features?

 What kind of values does each feature have?
 Which features are discrete and which are continuous valued?
 How are the values distributed?
 Can we spot any outliers?
 …
Kinds of Feature

 Numerical
 Features with numerical scale
 Often involve mapping to reals
 Continuous
 Ex: age, price, etc.
 Ordinal
 Features with an ordering but without scale
 Some totally ordered set
 Ex: set of characters, strings, house numbers, etc.
 Allow mode and median as central statistics, and quantiles as dispersion statistics

 Categorical
 Features without ordering or scale
 Allows no statistical summary except for mode
 Boolean feature is a subspace of categorical feature
Kinds of Feature

 Categorical and Ordinal features are qualitative:

 Describe a feature of an object without giving an actual size or quantity
 Numerical features are quantitative
Calculations of Features

 Aggregates or Statistics
 Main categories:
 Statistics of Central Tendency
 Statistics of Dispersion
 Shape Statistics
Statistics of Central Tendency

 Mean or Average value

 Median- the middle value if we order the instances from lowest to highest
feature value
 Mode- majority value or values
Statistics of Central Tendency: Mean

 The most common measure of the center of a set of data points

∑ ⋯
 𝑥̅ = = -- arithmetic mean
∑ ⋯
 𝑥̅ = ∑
= ⋯
-- weighted arithmetic mean

 Sensitive to extreme values (outliers)

 Trimmed mean– mean value computed after discarding values at the high
and low extremes
Statistics of Central Tendency: Median

 The middle value in a set of ordered data values

 Is a better measure of the center of the data for skewed (asymmetric) data
 Separates the higher half of a data set from the lower half
 Expensive to compute when we have large number of observations
 Applicable to numeric and ordinal features
Statistics of Central Tendency: Mode

 The value that occurs most frequently in the set

 Can be determined for qualitative and quantitative attributes
 Greatest frequency might correspond to several different values –
multimodal
 For unimodal numeric data that are moderately skewed:
𝑚𝑒𝑎𝑛 − 𝑚𝑜𝑑𝑒 ≈ 3 × (𝑚𝑒𝑎𝑛 − 𝑚𝑒𝑑𝑖𝑎𝑛)
Statistics of Central Tendency

 Mean or Average value

 Median- the middle value if we order the instances from lowest to highest
feature value
 Mode- majority value or values

 The mode is the one we can calculate whatever the domain of the feature. Ex:
blood type.
 In order to calculate median we need to have an ordering on the feature values
 In order to calculate mean we need feature to be expressed on some scale.
Statistics of Central Tendency
Statistics of Dispersion

 Range
 Quantiles
 Variance
 Standard deviation
Statistics of Dispersion: Range

 Let 𝑥 , 𝑥 , … , 𝑥 be a set of observations for some numeric attribute 𝑋

 The range of the set is the difference between the largest and smallest
values
Statistics of Dispersion: Quantiles

 Suppose that 𝑋 attribute is sorted in increasing order

 Pick certain data points so as to split the data distribution into equal-size
consecutive sets – these data points are called quantiles
 Quantiles- data points taken at regular intervals of data distribution, dividing it
into equal-size consecutive sets
 The 2-quantile is the data point dividing the lower and upper halves of the data
distribution  corresponds to the median
 The 4-quantiles are the three data points that split the data distribution into four
equal parts  referred as quartiles
 100-quantiles divide the data distribution into 100 equal-sized consecutive sets
referred as percentiles
 The median, quartiles and percentiles are the most widely used forms of
quantiles
Example: Percentile of GDP
Example: Percentile of GDP
Example: Percentile of GDP

First quartile
Median
Mean
Third quartile

Mean > Median

Mean is more sensitive to outliers.

Median is preferred for skewed distributions like this.
Statistics of Dispersion: Quantiles

 Interquartile Range (IQR) – the distance between the first and third quartiles
𝐼𝑄𝑅 = 𝑄3 − 𝑄1
 Simple measure of spread that gives the range covered by the middle half of
the data

 Ex: Suppose we have the following values for salary (in thousands of dollars): 30,
36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110
 Q1=47 000 $
 Q2=52 000$
 Q3=63 000$
 IQR=63-47=16 000$
 A common approach for identifying suspected outliers is to single out values
falling at least 1.5 x IQR above the third or below the first quartiles
Statistics of Dispersion: Standard
Deviation and Variance
 Indicates how spread out a data distribution is
 Low std means that the data observations tend to be very close to the
mean
 High std indicates that the data are spread out over a large spread of
values
 The variance of 𝑁 observations 𝑥 , 𝑥 , … , 𝑥 for a numeric attribute 𝑋 is
1
𝜎 = (𝑥 − 𝑥̅ )
𝑁

 The std is equal to the square root of variance

Statistics of Dispersion: Standard
Deviation and Variance
 The basic properties of std:
 std measures spread about the mean and should be considered only when
mean is chosen as the measure of center
 std=0 only when there is no spread, i.e. when all observations have the same
value
 An observation is unlikely to be more than several stds away from the mean
Histogram: GDP

GDP per capita is a real-valued feature

We can get its mode by means of histogram

The leftmost bin is the mode, third of the countries have GDP per capita of not more than $2000.
This distribution is extremely right-skewed, resulting in a mean that is considerably higher than the
median.
Scatter Plots and Data Correlation

 Scatter Plot- one of the most effective graphical methods for determining if
there is a relationship between two numeric features
 Provides first look for the clusters and outliers, or to explore the possibility of
correlation relationships
 Two features X and Y are correlated if one feature implies the other
 Correlations can be positive, negative or null (uncorrelated)
Scatter Plots and Data Correlation
Shape Statistics
Shape Statistics

 Skewness :

 𝜎 is a standard deviation
 Positive skewness indicate that the distribution is right-skewed (right tail is longer
than the left tail)
 Negative skewness indicate that the distribution is left-skewed
 Kurtosis:

 Normal distribution has a kurtosis of 3

 Excess kurtosis: −3
 Positive excess kurtosis means that the distribution is more sharply peaked than
the normal distribution.
Example: GDP
Kinds of Feature
Feature types and Models

 Models treat different kinds of feature in distinct ways

 Decision trees
 A split on a categorical feature will have as many children as there are feature values
 Ordinal and quantitative features lead to binary split
 Tree models ignore the scale of quantitative features, treating them as ordinal

 Naïve Bayes
 Works well with categorical features
 Treats ordinal features as categorical, ignoring the order
 Cannot deal with quantitative features unless discretized

 Linear Models
 Can only handle quantitative features
 Linearity assumes Euclidean instance space where features act as Cartesian coordinates

 Distance-based methods
 Can accommodate all feature types by using an appropriate distance metric
Data Pre-processing: Cleaning

 Real data tend to be incomplete and noisy.

1. Missing Values
2. Noisy Data
Data Cleaning: Missing Values

1. Ignore the sample

 Usually done when the class label is missing
2. Fill in the missing value manually
 Time consuming for large data with lots of missing values
3. Use a global constant to fill in the missing value
4. Use measure of central tendency for the attribute to fill in the missing value
 Use mean for normal data distribution, median for skewed data distribution
5. Use the attribute mean or median for all samples belonging to the same class as
the given sample
6. Use the most probable value to fill in the missing value
 Can be determined with regression, decision tree induction, etc.
Data Cleaning: Noisy Data

 Noise is a random error or variance in a measured variable

 We need to smooth out the data to remove the noise
1) Binning Methods: Smooth a sorted data by consulting its neighbourhood
Data Cleaning: Noisy Data

 Noise is a random error or variance in a measured variable

 We need to smooth out the data to remove the noise
1) Binning Methods: Smooth a sorted data by consulting its neighbourhood
2) Regression
3) Outlier analysis – can be detected by clustering
Feature Transformations

 Aim at improving the utility of a feature by changing, removing or adding

information.
 Feature types ordered by the amount of detail they convey:
1. Quantitative
2. Ordinal
3. Categorical
4. Boolean
Feature Transformations

 Binarization:
 Transforms a categorical feature into a set of Boolean features, one for each
value of the categorical feature.
 Loses information.
 Needed if model cannot handle more than two feature values.
 Unordering:
 Turns an ordinal feature into categorical one by discarding the ordering of the
feature values.
 Often required since most learning models cannot handle ordinal features
directly.
Feature Transformations

 Thresholding:
 Transforms a quantitative or ordinal feature into Boolean feature by finding a
feature value to split
 𝐿𝑒𝑡 𝑓: 𝑋 → ℝ be a quantitative feature, and let 𝑡 ∈ ℝ be a threshold, then 𝑓 : 𝑋 →
{𝑡𝑟𝑢𝑒, 𝑓𝑎𝑙𝑠𝑒} is a Boolean feature defined by 𝑓 𝑥 = 𝑡𝑟𝑢𝑒 if 𝑓(𝑥) ≥ 𝑡 and 𝑓 𝑥 =
𝑓𝑎𝑙𝑠𝑒 if 𝑓 𝑥 < 𝑡
 Threshold can be selected in unsupervised or supervised way
 Unsupervised- involves computing some statistics over the data, typically statistics of
central tendency (mean, median).
 Supervised- requires sorting the data on the feature value and traversing downs this
ordering to optimize a particular objective function.
Feature Transformations

 Discretization:
 Multiple threshold case.
 Transforms quantitative feature into an ordinal feature.
 Unsupervised discretization:
 divide feature values into predetermined bins.

 Supervised discretization:
Feature Transformations

 Normalization:
 Unsupervised feature transformation.
 Often required to neutralize the effect of different quantitative features being
measured on different scales.
 Mostly understood as expressing the feature on a [0,1] scale.
 Typically done by subtracting the mean and dividing by standard deviation.
Feature Transformations

 Calibration:
 Supervised feature transformation adding a meaningful scale carrying class
information to arbitrary features.
 Allows models that require scale, such as linear classifiers, to handle ordinal and
categorical features.
Calibration Example
Feature Transformations

02 Data
No ratings yet
02 Data
36 pages
Data and Metrics
No ratings yet
Data and Metrics
35 pages
9-1 Data Analysis and Pre-Processing Part 1 PDF
No ratings yet
9-1 Data Analysis and Pre-Processing Part 1 PDF
19 pages
Data Mining 1
No ratings yet
Data Mining 1
29 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
78 pages
2 1 Data
No ratings yet
2 1 Data
22 pages
Module 1
No ratings yet
Module 1
64 pages
ML 3170724 Unit-2
No ratings yet
ML 3170724 Unit-2
40 pages
CH 2
No ratings yet
CH 2
35 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
46 pages
Transportation Data Mining: Chapter 2. Getting To Know Your Data
No ratings yet
Transportation Data Mining: Chapter 2. Getting To Know Your Data
77 pages
02know Your Data-Lecture2-3
No ratings yet
02know Your Data-Lecture2-3
53 pages
DM Introduction
No ratings yet
DM Introduction
50 pages
02know Your Data Lecture2 3
No ratings yet
02know Your Data Lecture2 3
53 pages
02 KnowYourData
No ratings yet
02 KnowYourData
44 pages
IT326 - Ch2
No ratings yet
IT326 - Ch2
44 pages
Unit 1b
No ratings yet
Unit 1b
69 pages
02data - 7 7 25
No ratings yet
02data - 7 7 25
63 pages
Data Analysis & Visualization Guide
No ratings yet
Data Analysis & Visualization Guide
63 pages
Dsbda Unit 2
No ratings yet
Dsbda Unit 2
155 pages
02data DMDW
No ratings yet
02data DMDW
40 pages
Data Analysts-1
No ratings yet
Data Analysts-1
65 pages
Basic Statistical Descriptions of Data
No ratings yet
Basic Statistical Descriptions of Data
26 pages
02 Data
No ratings yet
02 Data
66 pages
02 Data
No ratings yet
02 Data
62 pages
02 Data
No ratings yet
02 Data
65 pages
VIPDMTheory Chapter 2
No ratings yet
VIPDMTheory Chapter 2
56 pages
Lecture 2 - Exploratory Data Analysis
No ratings yet
Lecture 2 - Exploratory Data Analysis
35 pages
02 Data
No ratings yet
02 Data
35 pages
Chapter 2
No ratings yet
Chapter 2
65 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
29 pages
Chapter2-Statistical Analysis
No ratings yet
Chapter2-Statistical Analysis
86 pages
Data Mining (DM) : Lecture 3: Know Your Data
No ratings yet
Data Mining (DM) : Lecture 3: Know Your Data
53 pages
CPSC 4830 2025summer Lecture 2
No ratings yet
CPSC 4830 2025summer Lecture 2
42 pages
Unit2PreparingtoModelpptx 2023 09 02 14 52 40
No ratings yet
Unit2PreparingtoModelpptx 2023 09 02 14 52 40
43 pages
Lec 2
No ratings yet
Lec 2
26 pages
DM-Knowing Your Data
No ratings yet
DM-Knowing Your Data
56 pages
Lecture 2
No ratings yet
Lecture 2
62 pages
Week2 1
No ratings yet
Week2 1
24 pages
02 Kinds of Data
No ratings yet
02 Kinds of Data
41 pages
02 Data
No ratings yet
02 Data
41 pages
Lectur 4 Basic Statistical Descriptions of Data
No ratings yet
Lectur 4 Basic Statistical Descriptions of Data
44 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
02 Data
No ratings yet
02 Data
65 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
Data Type, Data Chart, Descriptive Statistics
No ratings yet
Data Type, Data Chart, Descriptive Statistics
65 pages
Lect 3
No ratings yet
Lect 3
51 pages
Machine Learning Attribute Types Explained
No ratings yet
Machine Learning Attribute Types Explained
31 pages
Lec.02 Getting To Know Your Data
No ratings yet
Lec.02 Getting To Know Your Data
62 pages
CS 591.03 Introduction To Data Mining Instructor: Abdullah Mueen
No ratings yet
CS 591.03 Introduction To Data Mining Instructor: Abdullah Mueen
52 pages
CS822 DataMining Week2
No ratings yet
CS822 DataMining Week2
28 pages
02data Part2
No ratings yet
02data Part2
34 pages
Data Mining 2
No ratings yet
Data Mining 2
64 pages
02data Edited v2
No ratings yet
02data Edited v2
43 pages
02 Data
No ratings yet
02 Data
64 pages
CHP 2
No ratings yet
CHP 2
52 pages
Chapter 2
No ratings yet
Chapter 2
53 pages
Discounter Class Lab Assignment for CMPE 211
No ratings yet
Discounter Class Lab Assignment for CMPE 211
2 pages
Lab 03
No ratings yet
Lab 03
2 pages
UDP and TCP Packet Analysis Lab
No ratings yet
UDP and TCP Packet Analysis Lab
4 pages
Decision Trees
No ratings yet
Decision Trees
38 pages
CMPE472 Quiz#1
100% (1)
CMPE472 Quiz#1
52 pages
Questions 4
No ratings yet
Questions 4
16 pages
Chapter 8 V7.0
No ratings yet
Chapter 8 V7.0
129 pages
Computer Engineering Department TED University: CMPE 252 - C Programming, Spring 2021 Lab 2
No ratings yet
Computer Engineering Department TED University: CMPE 252 - C Programming, Spring 2021 Lab 2
4 pages
C Programming Lab: Recursive Functions & Line Length Calculation
No ratings yet
C Programming Lab: Recursive Functions & Line Length Calculation
3 pages
Fractiles-of-Group-Data
No ratings yet
Fractiles-of-Group-Data
34 pages
4 - Q4 Mathematics
No ratings yet
4 - Q4 Mathematics
16 pages
Comprehensive Statistics Guide
No ratings yet
Comprehensive Statistics Guide
81 pages
Process Capability For Non Normal Data Notes
No ratings yet
Process Capability For Non Normal Data Notes
17 pages
Statistics For Economists-2
No ratings yet
Statistics For Economists-2
31 pages
2026 L1 QuantMethods
No ratings yet
2026 L1 QuantMethods
61 pages
Math10 Q4 M11
No ratings yet
Math10 Q4 M11
13 pages
Quantiles and Measures of Position Guide
No ratings yet
Quantiles and Measures of Position Guide
10 pages
CH 3 and 4
100% (4)
CH 3 and 4
44 pages
Exponential Smmothing
No ratings yet
Exponential Smmothing
30 pages
User Manual DASP Version 2.2: DASP: Distributive Analysis Stata Package
No ratings yet
User Manual DASP Version 2.2: DASP: Distributive Analysis Stata Package
151 pages
Math10 Q4 Week 2-SSLM
No ratings yet
Math10 Q4 Week 2-SSLM
4 pages
R08 Statistical Concepts and Market Returns Q Bank
No ratings yet
R08 Statistical Concepts and Market Returns Q Bank
24 pages
Bootstrap: Estimate Statistical Uncertainties
No ratings yet
Bootstrap: Estimate Statistical Uncertainties
22 pages
Quantile Regression
No ratings yet
Quantile Regression
6 pages
Cumulative Distribution Function in RL
No ratings yet
Cumulative Distribution Function in RL
3 pages
Vietnam Income and Living Standards Report
No ratings yet
Vietnam Income and Living Standards Report
19 pages
Mathematics: Quarter 4 - Module 2
50% (2)
Mathematics: Quarter 4 - Module 2
36 pages
Data Preprocessing Techniques in ML
No ratings yet
Data Preprocessing Techniques in ML
12 pages
Week 10 Answers
No ratings yet
Week 10 Answers
3 pages
MATH1280 Assignment Unit 6 Solutions
No ratings yet
MATH1280 Assignment Unit 6 Solutions
6 pages
Wa0125.
No ratings yet
Wa0125.
21 pages
Understanding Frequency Distributions
No ratings yet
Understanding Frequency Distributions
102 pages
Diagnostic Test in Math 10-Tq
No ratings yet
Diagnostic Test in Math 10-Tq
6 pages
Teaching Sampling
No ratings yet
Teaching Sampling
91 pages
(Ebook PDF) CFA Program Curriculum 2020 Level I Volumes 1-6 Box Set (CFA Curriculum 2020) PDF Download
100% (6)
(Ebook PDF) CFA Program Curriculum 2020 Level I Volumes 1-6 Box Set (CFA Curriculum 2020) PDF Download
44 pages
Boosting Financial Performance via Women's Workforce Inclusion
No ratings yet
Boosting Financial Performance via Women's Workforce Inclusion
17 pages
Maternal Nutrition Inequlities
No ratings yet
Maternal Nutrition Inequlities
10 pages
Vegclust
No ratings yet
Vegclust
47 pages

Features

Uploaded by

Features

Uploaded by

CMPE 442 Introduction to

 Data set- set of data objects

 What are the types of features?

 Categorical and Ordinal features are qualitative:

 Mean or Average value

 The most common measure of the center of a set of data points

 Sensitive to extreme values (outliers)

 The middle value in a set of ordered data values

 The value that occurs most frequently in the set

 Mean or Average value

 Let 𝑥 , 𝑥 , … , 𝑥 be a set of observations for some numeric attribute 𝑋

 Suppose that 𝑋 attribute is sorted in increasing order

Mean > Median

Mean is more sensitive to outliers.

 The std is equal to the square root of variance

GDP per capita is a real-valued feature

 Normal distribution has a kurtosis of 3

 Models treat different kinds of feature in distinct ways

 Real data tend to be incomplete and noisy.

1. Ignore the sample

 Noise is a random error or variance in a measured variable

 Noise is a random error or variance in a measured variable

 Aim at improving the utility of a feature by changing, removing or adding

You might also like