CMPE 442 Introduction to
Machine Learning
• Features
Features
Data set- set of data objects
Data object – represents an entity described by a set of attributes
A feature is a data field representing a characteristic of a data object
Workhorses of ML
Mapping from instance space to the feature space
Distinguish features by:
domain types
range of permissible operations
Features
Workhorses of ML
Mapping from instance space to the feature space
Distinguish features by:
domain types
range of permissible operations
Ex: Consider two features: person’s age and house number: while both are
numbers, house number is an ordinal hence taking the average is
meaningless.
What matters is not just the domain of a feature but also the range of
permissible operations.
Getting to Know Your Data
What are the types of features?
What kind of values does each feature have?
Which features are discrete and which are continuous valued?
How are the values distributed?
Can we spot any outliers?
…
Kinds of Feature
Numerical
Features with numerical scale
Often involve mapping to reals
Continuous
Ex: age, price, etc.
Ordinal
Features with an ordering but without scale
Some totally ordered set
Ex: set of characters, strings, house numbers, etc.
Allow mode and median as central statistics, and quantiles as dispersion statistics
Categorical
Features without ordering or scale
Allows no statistical summary except for mode
Boolean feature is a subspace of categorical feature
Kinds of Feature
Categorical and Ordinal features are qualitative:
Describe a feature of an object without giving an actual size or quantity
Numerical features are quantitative
Calculations of Features
Aggregates or Statistics
Main categories:
Statistics of Central Tendency
Statistics of Dispersion
Shape Statistics
Statistics of Central Tendency
Mean or Average value
Median- the middle value if we order the instances from lowest to highest
feature value
Mode- majority value or values
Statistics of Central Tendency: Mean
The most common measure of the center of a set of data points
∑ ⋯
𝑥̅ = = -- arithmetic mean
∑ ⋯
𝑥̅ = ∑
= ⋯
-- weighted arithmetic mean
Sensitive to extreme values (outliers)
Trimmed mean– mean value computed after discarding values at the high
and low extremes
Statistics of Central Tendency: Median
The middle value in a set of ordered data values
Is a better measure of the center of the data for skewed (asymmetric) data
Separates the higher half of a data set from the lower half
Expensive to compute when we have large number of observations
Applicable to numeric and ordinal features
Statistics of Central Tendency: Mode
The value that occurs most frequently in the set
Can be determined for qualitative and quantitative attributes
Greatest frequency might correspond to several different values –
multimodal
For unimodal numeric data that are moderately skewed:
𝑚𝑒𝑎𝑛 − 𝑚𝑜𝑑𝑒 ≈ 3 × (𝑚𝑒𝑎𝑛 − 𝑚𝑒𝑑𝑖𝑎𝑛)
Statistics of Central Tendency
Mean or Average value
Median- the middle value if we order the instances from lowest to highest
feature value
Mode- majority value or values
The mode is the one we can calculate whatever the domain of the feature. Ex:
blood type.
In order to calculate median we need to have an ordering on the feature values
In order to calculate mean we need feature to be expressed on some scale.
Statistics of Central Tendency
Statistics of Dispersion
Range
Quantiles
Variance
Standard deviation
Statistics of Dispersion: Range
Let 𝑥 , 𝑥 , … , 𝑥 be a set of observations for some numeric attribute 𝑋
The range of the set is the difference between the largest and smallest
values
Statistics of Dispersion: Quantiles
Suppose that 𝑋 attribute is sorted in increasing order
Pick certain data points so as to split the data distribution into equal-size
consecutive sets – these data points are called quantiles
Quantiles- data points taken at regular intervals of data distribution, dividing it
into equal-size consecutive sets
The 2-quantile is the data point dividing the lower and upper halves of the data
distribution corresponds to the median
The 4-quantiles are the three data points that split the data distribution into four
equal parts referred as quartiles
100-quantiles divide the data distribution into 100 equal-sized consecutive sets
referred as percentiles
The median, quartiles and percentiles are the most widely used forms of
quantiles
Example: Percentile of GDP
Example: Percentile of GDP
Example: Percentile of GDP
First quartile
Median
Mean
Third quartile
Mean > Median
Mean is more sensitive to outliers.
Median is preferred for skewed distributions like this.
Statistics of Dispersion: Quantiles
Interquartile Range (IQR) – the distance between the first and third quartiles
𝐼𝑄𝑅 = 𝑄3 − 𝑄1
Simple measure of spread that gives the range covered by the middle half of
the data
Ex: Suppose we have the following values for salary (in thousands of dollars): 30,
36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110
Q1=47 000 $
Q2=52 000$
Q3=63 000$
IQR=63-47=16 000$
A common approach for identifying suspected outliers is to single out values
falling at least 1.5 x IQR above the third or below the first quartiles
Statistics of Dispersion: Standard
Deviation and Variance
Indicates how spread out a data distribution is
Low std means that the data observations tend to be very close to the
mean
High std indicates that the data are spread out over a large spread of
values
The variance of 𝑁 observations 𝑥 , 𝑥 , … , 𝑥 for a numeric attribute 𝑋 is
1
𝜎 = (𝑥 − 𝑥̅ )
𝑁
The std is equal to the square root of variance
Statistics of Dispersion: Standard
Deviation and Variance
The basic properties of std:
std measures spread about the mean and should be considered only when
mean is chosen as the measure of center
std=0 only when there is no spread, i.e. when all observations have the same
value
An observation is unlikely to be more than several stds away from the mean
Histogram: GDP
GDP per capita is a real-valued feature
We can get its mode by means of histogram
The leftmost bin is the mode, third of the countries have GDP per capita of not more than $2000.
This distribution is extremely right-skewed, resulting in a mean that is considerably higher than the
median.
Scatter Plots and Data Correlation
Scatter Plot- one of the most effective graphical methods for determining if
there is a relationship between two numeric features
Provides first look for the clusters and outliers, or to explore the possibility of
correlation relationships
Two features X and Y are correlated if one feature implies the other
Correlations can be positive, negative or null (uncorrelated)
Scatter Plots and Data Correlation
Shape Statistics
Shape Statistics
Skewness :
𝜎 is a standard deviation
Positive skewness indicate that the distribution is right-skewed (right tail is longer
than the left tail)
Negative skewness indicate that the distribution is left-skewed
Kurtosis:
Normal distribution has a kurtosis of 3
Excess kurtosis: −3
Positive excess kurtosis means that the distribution is more sharply peaked than
the normal distribution.
Example: GDP
Kinds of Feature
Feature types and Models
Models treat different kinds of feature in distinct ways
Decision trees
A split on a categorical feature will have as many children as there are feature values
Ordinal and quantitative features lead to binary split
Tree models ignore the scale of quantitative features, treating them as ordinal
Naïve Bayes
Works well with categorical features
Treats ordinal features as categorical, ignoring the order
Cannot deal with quantitative features unless discretized
Linear Models
Can only handle quantitative features
Linearity assumes Euclidean instance space where features act as Cartesian coordinates
Distance-based methods
Can accommodate all feature types by using an appropriate distance metric
Data Pre-processing: Cleaning
Real data tend to be incomplete and noisy.
1. Missing Values
2. Noisy Data
Data Cleaning: Missing Values
1. Ignore the sample
Usually done when the class label is missing
2. Fill in the missing value manually
Time consuming for large data with lots of missing values
3. Use a global constant to fill in the missing value
4. Use measure of central tendency for the attribute to fill in the missing value
Use mean for normal data distribution, median for skewed data distribution
5. Use the attribute mean or median for all samples belonging to the same class as
the given sample
6. Use the most probable value to fill in the missing value
Can be determined with regression, decision tree induction, etc.
Data Cleaning: Noisy Data
Noise is a random error or variance in a measured variable
We need to smooth out the data to remove the noise
1) Binning Methods: Smooth a sorted data by consulting its neighbourhood
Data Cleaning: Noisy Data
Noise is a random error or variance in a measured variable
We need to smooth out the data to remove the noise
1) Binning Methods: Smooth a sorted data by consulting its neighbourhood
2) Regression
3) Outlier analysis – can be detected by clustering
Feature Transformations
Aim at improving the utility of a feature by changing, removing or adding
information.
Feature types ordered by the amount of detail they convey:
1. Quantitative
2. Ordinal
3. Categorical
4. Boolean
Feature Transformations
Binarization:
Transforms a categorical feature into a set of Boolean features, one for each
value of the categorical feature.
Loses information.
Needed if model cannot handle more than two feature values.
Unordering:
Turns an ordinal feature into categorical one by discarding the ordering of the
feature values.
Often required since most learning models cannot handle ordinal features
directly.
Feature Transformations
Thresholding:
Transforms a quantitative or ordinal feature into Boolean feature by finding a
feature value to split
𝐿𝑒𝑡 𝑓: 𝑋 → ℝ be a quantitative feature, and let 𝑡 ∈ ℝ be a threshold, then 𝑓 : 𝑋 →
{𝑡𝑟𝑢𝑒, 𝑓𝑎𝑙𝑠𝑒} is a Boolean feature defined by 𝑓 𝑥 = 𝑡𝑟𝑢𝑒 if 𝑓(𝑥) ≥ 𝑡 and 𝑓 𝑥 =
𝑓𝑎𝑙𝑠𝑒 if 𝑓 𝑥 < 𝑡
Threshold can be selected in unsupervised or supervised way
Unsupervised- involves computing some statistics over the data, typically statistics of
central tendency (mean, median).
Supervised- requires sorting the data on the feature value and traversing downs this
ordering to optimize a particular objective function.
Feature Transformations
Discretization:
Multiple threshold case.
Transforms quantitative feature into an ordinal feature.
Unsupervised discretization:
divide feature values into predetermined bins.
Supervised discretization:
Feature Transformations
Normalization:
Unsupervised feature transformation.
Often required to neutralize the effect of different quantitative features being
measured on different scales.
Mostly understood as expressing the feature on a [0,1] scale.
Typically done by subtracting the mean and dividing by standard deviation.
Feature Transformations
Calibration:
Supervised feature transformation adding a meaningful scale carrying class
information to arbitrary features.
Allows models that require scale, such as linear classifiers, to handle ordinal and
categorical features.
Calibration Example
Feature Transformations