What is a feature?
Generally, all machine learning algorithms take input data to generate the output.
The input data remains in a tabular form consisting of rows (instances or
observations) and columns (variable or attributes), and these attributes are often
known as features. For example, an image is an instance in computer vision, but a
line in the image could be the feature. Similarly, in NLP, a document can be an
observation, and the word count could be the feature. So, we can say a feature is an
attribute that impacts a problem or is useful for the problem.
What is Feature Engineering?
Feature engineering is the pre-processing step of machine learning, which
extracts features from raw data. It helps to represent an underlying problem to
predictive models in a better way, which as a result, improve the accuracy of the
model for unseen data. The predictive model contains predictor variables and an
outcome variable, and while the feature engineering process selects the most useful
predictor variables for the model.
5.5M
732
OOPs Concepts in Java
Since 2016, automated feature engineering is also used in different machine learning
software that helps in automatically extracting features from raw data. Feature
engineering in ML contains mainly four processes: Feature Creation,
Transformations, Feature Extraction, and Feature Selection.
These processes are described as below:
1. Feature Creation: Feature creation is finding the most useful variables to be
used in a predictive model. The process is subjective, and it requires human
creativity and intervention. The new features are created by mixing existing
features using addition, subtraction, and ration, and these new features have
great flexibility.
2. Transformations: The transformation step of feature engineering involves
adjusting the predictor variable to improve the accuracy and performance of
the model. For example, it ensures that the model is flexible to take input of
the variety of data; it ensures that all the variables are on the same scale,
making the model easier to understand. It improves the model's accuracy and
ensures that all the features are within the acceptable range to avoid any
computational error.
3. Feature Extraction: Feature extraction is an automated feature engineering
process that generates new variables by extracting them from the raw data.
The main aim of this step is to reduce the volume of data so that it can be
easily used and managed for data modelling. Feature extraction methods
include cluster analysis, text analytics, edge detection algorithms, and
principal components analysis (PCA).
4. Feature Selection: While developing the machine learning model, only a few
variables in the dataset are useful for building the model, and the rest features
are either redundant or irrelevant. If we input the dataset with all these
redundant and irrelevant features, it may negatively impact and reduce the
overall performance and accuracy of the model. Hence it is very important to
identify and select the most appropriate features from the data and remove
the irrelevant or less important features, which is done with the help of feature
selection in machine learning. "Feature selection is a way of selecting the
subset of the most relevant features from the original features set by
removing the redundant, irrelevant, or noisy features."
Below are some benefits of using feature selection in machine learning:
○ It helps in avoiding the curse of dimensionality.
○ It helps in the simplification of the model so that the researchers can easily
interpret it.
○ It reduces the training time.
○ It reduces overfitting hence enhancing the generalization.
Need for Feature Engineering in Machine Learning
In machine learning, the performance of the model depends on data pre-processing
and data handling. But if we create a model without pre-processing or data handling,
then it may not give good accuracy. Whereas, if we apply feature engineering on the
same model, then the accuracy of the model is enhanced. Hence, feature
engineering in machine learning improves the model's performance. Below are some
points that explain the need for feature engineering:
○ Better features mean flexibility. In machine learning, we always try to choose
the optimal model to get good results. However, sometimes after choosing the
wrong model, still, we can get better predictions, and this is because of better
features. The flexibility in features will enable you to select the less complex
models. Because less complex models are faster to run, easier to understand
and maintain, which is always desirable.
○ Better features mean simpler models. If we input the well-engineered
features to our model, then even after selecting the wrong parameters (Not
much optimal), we can have good outcomes. After feature engineering, it is
not necessary to do hard for picking the right model with the most optimized
parameters. If we have good features, we can better represent the complete
data and use it to best characterize the given problem.
○ Better features mean better results. As already discussed, in machine
learning, as data we will provide will get the same output. So, to obtain better
results, we must need to use better features.
Steps in Feature Engineering
The steps of feature engineering may vary as per different data scientists and ML
engineers. However, there are some common steps that are involved in most
machine learning algorithms, and these steps are as follows:
○ Data Preparation: The first step is data preparation. In this step, raw data
acquired from different resources are prepared to make it in a suitable format
so that it can be used in the ML model. The data preparation may contain
cleaning of data, delivery, data augmentation, fusion, ingestion, or loading.
○ Exploratory Analysis: Exploratory analysis or Exploratory data analysis (EDA)
is an important step of features engineering, which is mainly used by data
scientists. This step involves analysis, investing data set, and summarization
of the main characteristics of data. Different data visualization techniques are
used to better understand the manipulation of data sources, to find the most
appropriate statistical technique for data analysis, and to select the best
features for the data.
○ Benchmark: Benchmarking is a process of setting a standard baseline for
accuracy to compare all the variables from this baseline. The benchmarking
process is used to improve the predictability of the model and reduce the error
rate.
Feature Engineering Techniques
Some of the popular feature engineering techniques include:
1. Imputation
Feature engineering deals with inappropriate data, missing values, human
interruption, general errors, insufficient data sources, etc. Missing values within the
dataset highly affect the performance of the algorithm, and to deal with them
"Imputation" technique is used. Imputation is responsible for handling
irregularities within the dataset.
For example, removing the missing values from the complete row or complete
column by a huge percentage of missing values. But at the same time, to maintain
the data size, it is required to impute the missing data, which can be done as:
○ For numerical data imputation, a default value can be imputed in a column, and
missing values can be filled with means or medians of the columns. ○ For
categorical data imputation, missing values can be interchanged with the
maximum occurred value in a column.
2. Handling Outliers
Outliers are the deviated values or data points that are observed too away from other
data points in such a way that they badly affect the performance of the model.
Outliers can be handled with this feature engineering technique. This technique first
identifies the outliers and then remove them out.
Standard deviation can be used to identify the outliers. For example, each value
within a space has a definite to an average distance, but if a value is greater distant
than a certain value, it can be considered as an outlier. Z-score can also be used to
detect outliers.
3. Log transform
Logarithm transformation or log transform is one of the commonly used
mathematical techniques in machine learning. Log transform helps in handling the
skewed data, and it makes the distribution more approximate to normal after
transformation. It also reduces the effects of outliers on the data, as because of the
normalization of magnitude differences, a model becomes much robust.
Note: Log transformation is only applicable for the positive values; else, it will give an
error. To avoid this, we can add 1 to the data before transformation, which ensures
transformation to be positive.
4. Binning
In machine learning, overfitting is one of the main issues that degrade the
performance of the model and which occurs due to a greater number of parameters
and noisy data. However, one of the popular techniques of feature engineering,
"binning", can be used to normalize the noisy data. This process involves segmenting
different features into bins.
5. Feature Split
As the name suggests, feature split is the process of splitting features intimately into
two or more parts and performing to make new features. This technique helps the
algorithms to better understand and learn the patterns in the dataset.
The feature splitting process enables the new features to be clustered and binned,
which results in extracting useful information and improving the performance of the
data models.
6. One hot encoding
One hot encoding is the popular encoding technique in machine learning. It is a
technique that converts the categorical data in a form so that they can be easily
understood by machine learning algorithms and hence can make a good prediction.
It enables group the of categorical data without losing any information
What is Principal Component Analysis(PCA)?
Principal Component Analysis(PCA) technique was introduced by the
mathematician Karl Pearson in 1901. It works on the condition that while the
data in a higher dimensional space is mapped to data in a lower dimension
space, the variance of the data in the lower dimensional space should be
maximum.
● Principal Component Analysis (PCA) is a statistical procedure that
uses an orthogonal transformation that converts a set of correlated
variables to a set of uncorrelated variables.PCA is the most widely
used tool in exploratory data analysis and in machine learning for
predictive models. Moreover,
● Principal Component Analysis (PCA) is an unsupervised learning
algorithm technique used to examine the interrelations among a set of
variables. It is also known as a general factor analysis where
regression determines a line of best fit.
● The main goal of Principal Component Analysis (PCA) is to reduce the
dimensionality of a dataset while preserving the most important
patterns or relationships between the variables without any prior
knowledge of the target variables.
Principal Component Analysis (PCA) is used to reduce the dimensionality of a
data set by finding a new set of variables, smaller than the original set of
variables, retaining most of the sample’s information, and useful for the
regression and classification of data.
1. Principal Component Analysis (PCA) is a technique for dimensionality
reduction that identifies a set of orthogonal axes, called principal
components, that capture the maximum variance in the data. The
principal components are linear combinations of the original
variables in the dataset and are ordered in decreasing order of
importance. The total variance captured by all the principal
components is equal to the total variance in the original dataset.
2. The first principal component captures the most variation in the data,
but the second principal component captures the maximum variance
that is orthogonal to the first principal component, and so on.
3. Principal Component Analysis can be used for a variety of purposes,
including data visualization, feature selection, and data compression.
In data visualization, PCA can be used to plot high-dimensional data
in two or three dimensions, making it easier to interpret. In feature
selection, PCA can be used to identify the most important variables in
a dataset. In data compression, PCA can be used to reduce the size of
a dataset without losing important information.
4. In Principal Component Analysis, it is assumed that the information is
carried in the variance of the features, that is, the higher the variation
in a feature, the more information that features carries.
Overall, PCA is a powerful tool for data analysis and can help to simplify
complex datasets, making them easier to understand and work with.
Step-By-Step Explanation of PCA (Principal Component
Analysis) Step 1: Standardization
First, we need to standardize our dataset to ensure that each variable has a
mean of 0 and a standard deviation of 1.
Step2: Covariance Matrix Computation
Covariance measures the strength of joint variability between two or more
variables, indicating how much they change in relation to each other. To find
the covariance we can use the formula:
The value of covariance can be positive, negative, or zeros.
● Positive: As the x1 increases x2 also increases.
● Negative: As the x1 increases x2 also decreases.
● Zeros: No direct relation
Step 3: Compute Eigenvalues and Eigenvectors of Covariance Matrix to
Identify Principal Components
IHow Principal Component Analysis(PCA) works?
Hence, PCA employs a linear transformation that is based on preserving the
most variance in the data using the least number of dimensions.
Dimensionality reduction is then obtained by only retaining those axes
(dimensions) that account for most of the variance, and discarding all
others.
Advantages of Principal Component Analysis
1. Dimensionality Reduction: Principal Component Analysis is a popular
technique used for dimensionality reduction, which is the process of
reducing the number of variables in a dataset. By reducing the number of
variables, PCA simplifies data analysis, improves performance, and makes
it easier to visualize data.
2. Feature Selection: Principal Component Analysis can be used for feature
selection, which is the process of selecting the most important variables in
a dataset. This is useful in machine learning, where the number of
variables can be very large, and it is difficult to identify the most important
variables.
3. Data Visualization: Principal Component Analysis can be used for data
visualization. By reducing the number of variables, PCA can plot high
dimensional data in two or three dimensions, making it easier to interpret.
4. Multicollinearity: Principal Component Analysis can be used to deal with
multicollinearity, which is a common problem in a regression analysis
where two or more independent variables are highly correlated. PCA can
help identify the underlying structure in the data and create new,
uncorrelated variables that can be used in the regression model.
5. Noise Reduction: Principal Component Analysis can be used to reduce
the noise in data. By removing the principal components with low variance,
which are assumed to represent noise, Principal Component Analysis can
improve the signal-to-noise ratio and make it easier to identify the
underlying structure in the data.
6. Data Compression: Principal Component Analysis can be used for data
compression. By representing the data using a smaller number of principal
components, which capture most of the variation in the data, PCA can
reduce the storage requirements and speed up processing.
7. Outlier Detection: Principal Component Analysis can be used for outlier
detection. Outliers are data points that are significantly different from the
other data points in the dataset. Principal Component Analysis can identify
these outliers by looking for data points that are far from the other points in
the principal component space.
Disadvantages of Principal Component Analysis
1. Interpretation of Principal Components: The principal components
created by Principal Component Analysis are linear combinations of the
original variables, and it is often difficult to interpret them in terms of the
original variables. This can make it difficult to explain the results of PCA to
others.
2. Data Scaling: Principal Component Analysis is sensitive to the scale of the
data. If the data is not properly scaled, then PCA may not work well.
Therefore, it is important to scale the data before applying Principal
Component Analysis.
3. Information Loss: Principal Component Analysis can result in information
loss. While Principal Component Analysis reduces the number of
variables, it can also lead to loss of information. The degree of information
loss depends on the number of principal components selected. Therefore,
it is important to carefully select the number of principal components to
retain.
4. Non-linear Relationships: Principal Component Analysis assumes that
the relationships between variables are linear. However, if there are non
linear relationships between variables, Principal Component Analysis may
not work well.
5. Computational Complexity: Computing Principal Component Analysis
can be computationally expensive for large datasets. This is especially
true if the number of variables in the dataset is large.
6. Overfitting: Principal Component Analysis can sometimes result in
overfitting, which is when the model fits the training data too well and
performs poorly on new data. This can happen if too many principal
components are used or if the model is trained on a small dataset.
Feature Selection Techniques
There are mainly two types of Feature Selection techniques, which are:
○ Supervised Feature Selection technique Supervised Feature selection
techniques consider the target variable and can be used for the labelled
dataset.
○ Unsupervised Feature Selection technique Unsupervised Feature selection
techniques ignore the target variable and can be used for the unlabelled
dataset.
There are mainly three techniques under supervised feature Selection:
1. Wrapper Methods
In wrapper methodology, selection of features is done by considering it as a search
problem, in which different combinations are made, evaluated, and compared with
other combinations. It trains the algorithm by using the subset of features iteratively.
On the basis of the output of the model, features are added or subtracted, and with
this feature set, the model has trained again.
Some techniques of wrapper methods are:
○ Forward selection - Forward selection is an iterative process, which begins
with an empty set of features. After each iteration, it keeps adding on a
feature and evaluates the performance to check whether it is improving the
performance or not. The process continues until the addition of a new
variable/feature does not improve the performance of the model.
○ Backward elimination - Backward elimination is also an iterative approach,
but it is the opposite of forward selection. This technique begins the process
by considering all the features and removes the least significant feature. This
elimination process continues until removing the features does not improve
the performance of the model.
○ Exhaustive Feature Selection- Exhaustive feature selection is one of the best
feature selection methods, which evaluates each feature set as brute-force. It
means this method tries & make each possible combination of features and
return the best performing feature set.
○ Recursive Feature Elimination Recursive feature elimination is a recursive
greedy optimization approach, where features are selected by recursively
taking a smaller and smaller subset of features. Now, an estimator is trained
with each set of features, and the importance of each feature is determined
using coef_attribute or through a feature_importances_attribute.
2. Filter Methods
In Filter Method, features are selected on the basis of statistics measures. This
method does not depend on the learning algorithm and chooses the features as a
pre-processing step.
The filter method filters out the irrelevant feature and redundant columns from the
model by using different metrics through ranking.
The advantage of using filter methods is that it needs low computational time and
does not overfit the data.
Some common techniques of Filter methods are as follows:
○ Information Gain
○ Chi-square Test
○ Fisher's Score
○ Missing Value Ratio
Information Gain: Information gain determines the reduction in entropy while
transforming the dataset. It can be used as a feature selection technique by
calculating the information gain of each variable with respect to the target variable.
Chi-square Test: Chi-square test is a technique to determine the relationship
between the categorical variables. The chi-square value is calculated between each
feature and the target variable, and the desired number of features with the best chi
square value is selected.
Fisher's Score:
Fisher's score is one of the popular supervised technique of features selection. It
returns the rank of the variable on the fisher's criteria in descending order. Then we
can select the variables with a large fisher's score.
Missing Value Ratio:
The value of the missing value ratio can be used for evaluating the feature set
against the threshold value. The formula for obtaining the missing value ratio is the
number of missing values in each column divided by the total number of
observations. The variable is having more than the threshold value can be dropped.
3. Embedded Methods
Embedded methods combined the advantages of both filter and wrapper methods by
considering the interaction of features along with low computational cost. These are
fast processing methods similar to the filter method but more accurate than the filter
method.
These methods are also iterative, which evaluates each iteration, and optimally finds
the most important features that contribute the most to training in a particular
iteration. Some techniques of embedded methods are:
○ Regularization- Regularization adds a penalty term to different parameters of
the machine learning model for avoiding overfitting in the model. This penalty
term is added to the coefficients; hence it shrinks some coefficients to zero.
Those features with zero coefficients can be removed from the dataset. The
types of regularization techniques are L1 Regularization (Lasso
Regularization) or Elastic Nets (L1 and L2 regularization).
○ Random Forest Importance - Different tree-based methods of feature
selection help us with feature importance to provide a way of selecting
features. Here, feature importance specifies which feature has more
importance in model building or has a great impact on the target variable.
Random Forest is such a tree-based method, which is a type of bagging
algorithm that aggregates a different number of decision trees. It automatically
ranks the nodes by their performance or decrease in the impurity (Gini
impurity) over all the trees. Nodes are arranged as per the
impurity values, and thus it allows to pruning of trees below a specific node.
The remaining nodes create a subset of the most important features.
How to choose a Feature Selection Method?
For machine learning engineers, it is very important to understand that which feature
selection method will work properly for their model. The more we know the datatypes
of variables, the easier it is to choose the appropriate statistical measure for feature
selection.
To know this, we need to first identify the type of input and output variables. In
machine learning, variables are of mainly two types:
○ Numerical Variables: Variable with continuous values such as integer, float ○
Categorical Variables: Variables with categorical values such as Boolean,
ordinal, nominals.
Below are some univariate statistical measures, which can be used for filter-based
feature selection:
1. Numerical Input, Numerical Output:
Numerical Input variables are used for predictive regression modelling. The common
method to be used for such a case is the Correlation coefficient.
○ Pearson's correlation coefficient (For linear Correlation).
○ Spearman's rank coefficient (for non-linear correlation).
2. Numerical Input, Categorical Output:
Numerical Input with categorical output is the case for classification predictive
modelling problems. In this case, also, correlation-based techniques should be used,
but with categorical output.
○ ANOVA correlation coefficient (linear).
○ Kendall's rank coefficient (nonlinear).
3. Categorical Input, Numerical Output:
This is the case of regression predictive modelling with categorical input. It is a
different example of a regression problem. We can use the same measures as
discussed in the above case but in reverse order.
4. Categorical Input, Categorical Output:
This is a case of classification predictive modelling with categorical Input variables.
Mean
The average of the provided integers is the same as the arithmetic mean. It is a
number that represents all of the other numbers in a group of numbers. Let's say we
have a set of numbers and we need to get the mean of that set. To do this, all we
need to do is add the numbers together and divide the result by the total of the
numbers. We can determine the mean of this group of data from this. The mean of
the set of numbers is thus determined using the provided approach.
How to find mean
Let's use an example to better grasp this:
In a family, there are two brothers. The heights of those two brothers vary. The elder
brother is 150 cm tall, while the younger brother is 128 cm tall. Now, their parents are
curious about the two brothers' typical heights. To do this, he must calculate the
average height of the two brothers by averaging their heights.
= (128+150)/2
= 278/2
= 139 cm
We calculated their average height and mean height by adding their two heights
together and dividing the result by two.
Those two brothers are therefore 139 cm tall on average. As we can see, the average
height is bigger than the younger brother's height and smaller than the elder brother's
height, falling in between the two.
Formula of mean
Mean= Sum of terms/Number of terms
As you can see from the formula, we need to add up all the numbers that are
supplied to us, and we've also determined how many there are in total. The next step
is to divide the sum of the numbers by each one separately. By doing this, we will
obtain a number that is known as the number mean.
Median
The number that falls exactly in the middle of the provided numbers is the median, to
put it in very simple terms. The difference between the larger and smaller portions of
the group is determined by the number. It is sometimes referred to as the mean
share of the specified population.
We need to arrange the numbers in various ways in order to determine the median.
For instance, if we need to find the median of a set of integers, we must either write
the numbers in ascending order or decreasing order. The middle number in this set
of numbers is referred to as the median of these numbers when they are arranged in
this fashion.
How to find median
Example 1. Find the median of the given data:
2, 5, 7, 9, 12 in descending order?
Resolution:
First sort the numbers in descending order.
how:
12, 9, 7, 5, 2 like this.
So the median is 7. This is in the middle of the numbers.
Formula of median
We must first determine the total number of numbers in order to apply the formula.
We use the following formula if the number of observations is even:
Median = [(n/2)th term + ((n/2) + 1)th term]/2
The number of observations, or n, is used in the calculation above.
The formula above applies when there are even numbers of observations; however,
when there are odd numbers of observations, we use alternative formulas.
Median = [(n + 1)/2]th term
Mode
The value that appears most frequently in a dataset is referred to as the mode. In
other terms, it is the value that appears in a set of observations the most frequently.
Along with the mean and the median, the mode can be used to describe the center
tendency of a dataset.
How to find mode
As we saw from the information about mode above, the mode is the number that
occurs the most frequently among the group members.
Example:
From the numbers below, determine this group's mode:
4, 89, 65, 11, 54, 11, 90, 56
Answer: As you are aware, we must identify a number with the highest frequency in
order to determine the group's mode. Such a number is quite simple to locate.
We can immediately notice that this group contains 11 such numbers that are
occurring the most frequently. Therefore, 11 is the group's mode.
Formula of Mode
The equation Mode = 3 Median - 2 Mean can be used to determine the mode.