Machine Learning
Unit - 3
Dr. R.Seeta Sireesha
Associate Professor,
Department of Computer Science and
Engineering
GVPCE (A), Madhurawada.
Feature Engineering
⚫ Feature Extraction and Engineering
⚫ Feature Engineering on Numeric Data
⚫ Feature Engineering on Categorical Data
⚫ Feature Engineering on Text Data
⚫ Feature Engineering on Temporal Data
⚫ Feature Engineering on Image Data
⚫ Feature Scaling
⚫ Feature Selection
Feature Extraction
⚫ The process of extraction of features and engineering is the
most crucial step in the entire Machine Learning pipeline.
⚫ It determine the effectiveness of the model.
⚫ It converts data into features
⚫ It uses domain knowledge, hand-crafted techniques and
mathematical transformations.
Feature Engineering
⚫ Feature Extraction and Feature Engineering are almost
synonyms.
⚫ Data will be in raw format.
⚫ Data has to be pre-processed.
⚫ This includes dealing with bad data, imputing missing values,
transforming specific values, and so on.
⚫ Features are the final end result from the process of feature
engineering, which depicts various representations of the
underlying data.
Feature Engineering Examples
Examples of engineering features:
• Deriving a person’s age from birth date and the current date
• Getting the average and median view count of specific songs and
music videos
• Extracting word and phrase occurrence counts from text
documents
• Extracting pixel information from raw images
• Tabulating occurrences of various grades obtained by students
Why Feature Engineering?
⚫ Better representation of data
⚫ Better performing models
⚫ Essential for model building and evaluation
⚫ More flexibility on data types
⚫ Emphasis on the business and domain
Feature engineering techniques
Feature engineering techniques for the following major data types
are learned:
• Numeric data
• Categorical data
• Text data
• Temporal data
• Image data
Feature Engineering on Numeric Data
⚫ Numeric features can be directly fed to Machine Learning
models.
⚫ We can use numeric variables directly as features without any
form of transformation or engineering.
⚫ These features can indicate values or counts.
Feature Engineering on Numeric Data Cont..
Rather than counts and values we may require some conversions
to the data for further model building.
⚫ Binarization
⚫ Rounding
⚫ Interactions
Examples
⚫ Binarizing song counts
⚫ scale of 1-10 and on a scale of 1-100
⚫ Attack', 'Defense', 'Attack^2', 'Attack x Defense', 'Defense^2'
Feature Engineering on Numeric Data Cont..
⚫ Binning: Transforming continuous numeric values into
discrete ones.
⚫ Fixed-Width Binning
⚫ Adaptive Binning
Feature Engineering on Numeric Data Cont..
⚫ Statistical Transformations
⚫ Log Transform
The original distribution can be represented in a more Gaussian or
normal-like graph after applying the log transform.
Feature Engineering on Categorical Data
⚫ Categorical variables—nominal and ordinal
⚫ Nominal categorical features are such that there is no concept
of ordering among the values. Use “enumerate” function.
⚫ Examples: Movie or video game genres, weather seasons, and
country names
⚫ Ordinal categorical variables can be ordered and sorted on the
basis of their values. Use “map” function.
⚫ Examples: clothing size, education level
Encoding Categorical Features
⚫ Purpose of Encoding
If we directly fed these transformed numeric representations
of categorical features into any algorithm, the model will interpret
these as raw numeric features and so the notion of magnitude will
be wrongly introduced in the system
Encoding Categorical Features Cont..
⚫ One Hot Encoding Scheme
Encoding Categorical Features Cont..
⚫ Dummy Coding Scheme
The dummy coding scheme is similar to the one hot
encoding scheme, except in the case of dummy coding scheme,
when applied on a categorical feature with m distinct labels, we
get m-1 binary features.
Encoding Categorical Features Cont..
⚫ Effect Coding Scheme
⚫ It is almost similar to the dummy coding scheme.
⚫ all 0s in the dummy coding scheme, replaced by -1s in the effect coding
scheme.
Feature Engineering on Text Data
⚫ Unstructured attributes - text and images
⚫ Dealing Unstructured attributes is challenging task.
⚫ Prior to feature engineering on text data, we need to perform
• Pre-processing and normalizing text
• Feature extraction and engineering
Text Pre-Processing
• Text tokenization and lower casing
• Removing special characters
• Contraction expansion
• Removing stopwords
• Correcting spellings
• Stemming
• Lemmatization
Bag of Words Model
TF-IDF Model
Word2Vec Model
Recent advancement of representing text as vectors.
- CBOW (Continuous Bag of Words) Model
- Continuous Skip-Gram Model
Feature Engineering on Temporal Data
⚫ Temporal data involves datasets that change over a period of
time.
⚫ Time-series based data is extensively used in multiple domains
like stock, commodity, and weather forecasting.
⚫ Timestamp object consists of date, time, and even a time based
offset, which can be used to identify the time zone.
⚫ Date based features
⚫ Each of these features can be used as categorical features and
further feature engineering can be done like one hot encoding,
aggregations, binning, and more.
⚫ Time based features
Feature Engineering on Image Data
⚫ Image Metadata Features
Most of useful features information can be found from the EXIF data,
which is usually recorded for each image by the device when the picture
is being taken.
Features that are obtainable from the image EXIF data.
⚫ Image create date and time
⚫ Image dimensions
⚫ Image compression format
⚫ Device make and model
⚫ Image resolution and aspect ratio
⚫ Image artist
⚫ Flash, aperture, focal length, and exposure
Feature Engineering on Image Data Cont..
⚫ Raw Image and Channel Pixels
⚫ An image can be represented by the value of each of its pixels as a two
dimensional array.
⚫ We can use numpy arrays.
⚫ Color images usually have three components also known as channels.
⚫ The R, G, and B channels stand for the red, green, and blue channels,
respectively.
⚫ This can be represented as a three dimensional array (m, n, c) where m
indicates the number of rows in the image, n indicates the number of
columns. These are determined by the image dimensions. The c indicates
which channel it represents (R, G or B).
Feature Engineering on Image Data Cont..
⚫ Gray scale Image Pixels
⚫ Converting images to gray scale is necessary to convert color
image representation as two-dimensional image.
⚫ Each pixel value can be computed using the equation
⚫ Y = 0.2125 x R + 0.7154 x G + 0.0721 x B
⚫ Where R, G & B are the pixel values of the three channels and Y
captures the final pixel intensity information and is usually
ranges from 0(complete intensity absence - black) to 1(complete
intensity presence - white).
Feature Engineering on Image Data Cont..
⚫ Binning Image Intensity Distribution
⚫ Image Aggregation Statistics
Feature Engineering on Image Data Cont..
⚫ Edge Detection
⚫ The canny edge detector algorithm is an edge detector algorithm. This
algorithm typically involves using a Gaussian distribution with a specific
standard deviation σ (sigma) to smoothen and denoise the image.
⚫ Then Sobel filter has to be applied to extract image intensity gradients.
Feature Engineering on Image Data Cont..
⚫ HOG algorithm
⚫ The image is normalized and denoised to remove excess illumination
effects.
⚫ First order image gradients are computed to capture image attributes
like contour, texture, and so on.
⚫ Gradient histograms are built on top of these gradients based on specific
windows called cells.
⚫ Finally these cells are normalized and a flattened feature descriptor is
obtained, which can be used as a feature vector for models.
Feature Scaling
⚫ When dealing with numeric features, certain attributes may be
completely unbounded in nature, like view counts of a video or
web page hits.
⚫ Models are sensitive to the magnitude or scale of features like
linear or logistic regression.
Feature Scaling Cont..
⚫ Standardized Scaling(Standardization technique)- This is also
popularly known as Z-score scaling.
⚫ Min-Max Scaling(Normalization technique)- We can transform
and scale our feature values such that each value is within the
range of [0, 1].
Feature Scaling Cont..
Normalization Standardization
Centers data around the mean and scales to a standard
Rescales values to a range between 0 and 1
deviation of 1
Useful when the distribution of the data is unknown or not Useful when the distribution of the data is Gaussian or
Gaussian unknown
Sensitive to outliers Less sensitive to outliers
Retains the shape of the original distribution Changes the shape of the original distribution
May not preserve the relationships between the data points Preserves the relationships between the data points
Equation: (x – min)/(max – min) Equation: (x – mean)/standard deviation
Feature Selection
We have to select an optimal number of features to train and build
models that generalize very well on the data and prevent
overfitting.
They are classified as:
⚫ Filter methods
⚫ Wrapper methods
⚫ Embedded methods
Filter methods
⚫ These techniques select features purely based on metrics like
correlation, mutual information etc.
⚫ These methods do not depend on results obtained from any model
and usually check the relationship of each feature with the response
variable to be predicted.
⚫ Popular methods include threshold based methods and
statistical tests.
Threshold based methods
This is a filter based feature selection strategy, where you can use
some form of cut-off or thresholding for limiting the total number
of features during feature selection.
Statistical Methods
⚫ To select features based on univariate statistical tests.
⚫ Techniques available are: Mutual information, ANOVA
(analysis of variance) and chi-square tests.
⚫ Based on scores obtained from these statistical tests, you can
select the best features on the basis of their score.
Wrapper methods
⚫ These techniques use a recursive approach to build multiple
models using feature subsets and select the best subset of
features giving us the best performing model.
⚫ Methods like backward selecting and forward elimination are
popular wrapper based methods.
Recursive Feature Elimination (RFE)
⚫ This strategy is also popularly known as backward elimination.
⚫ Recursive Feature Elimination, also known as RFE is a popular Wrapper
Method.
Embedded methods
⚫ These techniques try to combine the benefits of the other two
methods by leveraging Machine Learning models themselves to
rank and score feature variables based on their importance.
⚫ Tree based methods like decision trees and ensemble methods
like random forests are popular examples of embedded methods.
Feature Extraction
⚫ The basic objective of feature extraction is to extract new
features from the existing set of features such that the higher-
dimensional dataset with many features can be reduced into a
lower-dimensional dataset of these newly created features.
⚫ Popular technique of linear data transformation from higher to
lower dimensions is Principal Component Analysis, also known
as PCA.
Dimensionality Reduction
A very popular technique of linear data transformation from higher
to lower dimensions is Principal Component Analysis, also known as
PCA.
Principal component analysis, is a statistical method that uses the
process of linear, orthogonal transformation to transform a higher-
dimensional set of features that could be possibly correlated into a
lower-dimensional set of linearly uncorrelated features.
PCA
⚫ In any PCA transformation, the total number of PCs is always
less than or equal to the initial number of features.
⚫ The first principal component tries to capture the maximum
variance of the original set of features.
⚫ Each of the succeeding components tries to capture more of the
variance such that they are orthogonal to the preceding
components.
PCA cont..
⚫ Set of initial features, D has to be reduced to a subset of
extracted principal components of a lower dimension LD.
⚫ The matrix decomposition process of Singular Value
Decomposition is extremely useful in obtaining the principal
components.
PCA cont..
STEP 1: STANDARDIZATION
Calculate the Mean and Standard Deviation for each feature and then,
tabulate the same
STEP 2: COVARIANCE MATRIX COMPUTATION
Covariance(X, X) is Variance of (X)
PCA cont..
(COV(X, Y)=COV(Y, X)).
If the value of the Covariance Matrix is positive, then it indicates that the variables are correlated. ( If X
increases, Y also increases and vice versa)
If the value of the Covariance Matrix is negative, then it indicates that the variables are inversely correlated. (
If X increases, Y also decreases and vice versa).
STEP 4: FEATURE VECTOR
To determine the principal components of variables, you have to define eigen value and eigen vectors
for the same.
Let A be any square matrix. A non-zero vector v is an eigenvector of A if
Av = λv
for some number λ, called the corresponding eigenvalue.
PCA cont..
Then, substitute each eigen value in (A-λI)ν=0 equation and solve the same for
different eigen vectors.
Now, calculate the sum of each Eigen column, arrange them in descending order
and pick up the topmost Eigen values. These are the Principal components.
STEP 5: RECAST THE DATA ALONG THE PRINCIPAL COMPONENTS AXES
Final Data Set= Standardized Original Data Set * FeatureVector
PCA (simplified)
The steps involved in PCA Algorithm are as follows-
Step-01: Get data.
Step-02: Compute the mean vector (µ).
Step-03: Subtract mean from the given data.
Step-04: Calculate the covariance matrix.
Step-05: Calculate the eigen vectors and eigen values of the covariance matrix.
Step-06: Choosing components and forming a feature vector.
Step-07: Deriving the new data set.
Finally select the eigen vector based on MAX constraint(now it is known as
feature vector) and use it (multiply it with the Cov matrix) in identifying
the new vector.
Final Data Set= Standardized Original Data Set * FeatureVector