AIML Chapter 4
AIML Chapter 4
Basics of Machine
Learning
P R O F. I S H I TA T H E B A
C O M P U T E R E N G I N E E R I N G D E P T.
ADIT
Topics
Preparing to Model:
◦ Basic Types of Data in Machine Learning,
◦ Exploring Structure of Data,
◦ Data Quality and Remediation,
◦ Data Preprocessing
•High-dimensional Data: In some domains (e.g., image or text), data can have a
very high number of features (dimensions), which can make the modeling process
more complex and increase the risk of overfitting.
•Low-dimensional Data: In contrast, low-dimensional data often has fewer features
but might require better techniques for extracting useful patterns.
•Curse of Dimensionality: As the number of features increases, the amount of data
required to adequately cover the feature space grows exponentially. Techniques like
PCA (Principal Component Analysis) or feature selection are often used to reduce
dimensionality.
5) Data Distribution:
•Understanding how data is distributed (uniform, Gaussian, skewed) is key to choosing
appropriate models and preprocessing techniques. For instance:
• Skewed Data: In classification problems, imbalanced classes can lead to biased models.
Techniques like resampling or using balanced accuracy scores can help.
• Normal Distribution: Many models assume data is normally distributed (e.g., linear
regression). For this reason, checking and transforming data (e.g., log transformation) might
be necessary.
• Outliers: Identifying and dealing with outliers is important. Outliers might indicate errors,
rare events, or valuable insights, and how you handle them can significantly impact model
performance.
Exploring Data Structure
1) Initial Exploration
•Data Inspection:
•Head/Tail Inspection: Use commands like .head() or .tail() in
Pandas (for Python) to quickly see the top or bottom rows of a
dataset. This gives you an initial sense of the data.
•Shape: Check the dimensions of your dataset (.shape in Pandas) to
understand how many rows (data points) and columns (features) it
has.
2) Descriptive Statistics
• Summary Statistics: Use .describe() for numerical data to get an overview of central
tendencies, spread, and any potential anomalies (like extremely high or low values).
•For categorical data, you might use .value_counts() to see the frequency of each class.
•Visualize with histograms, boxplots, or scatter plots to understand data distributions and
relationships.
•Correlation Matrix: For numerical data, a correlation matrix can show how features are
related to one another. A heatmap is often used to visualize these correlations.
6) Missing Data
•Missing Value Detection: Identify missing values using .isna() or .isnull() in pandas. Visual
tools like heatmaps can show the presence of missing data across the dataset.
•Handling Missing Data: Decide whether to impute missing values (mean, median, etc.), use
predictive models, or drop rows/columns depending on the extent of missing data.
7) Feature Engineering
• Depending on the data structure, you might need to create new features (e.g.,
extracting the year or month from a date column), perform aggregation (like taking the
average of certain groups), or transform existing features (e.g., log transformation for
skewed data).
Data Quality and Remediation
Data quality and remediation are critical components of building and deploying effective
machine learning (ML) models.
High-quality data ensures that ML models can make accurate predictions, generalizations, and
offer real value.
When the data quality is poor, remediation becomes necessary to improve it before training and
deploying models.
Data Quality in Machine Learning
Data quality refers to the characteristics of the data that influence its usefulness for
analysis and decision-making. Poor data quality can lead to inaccurate or biased
models. Key aspects of data quality include:
1) Accuracy: The data must accurately represent the real-world phenomenon being
measured.
Example: A dataset of weather information that has inaccurate temperature readings
can lead to poor predictions in weather forecasting models.
2) Completeness: The data should have all necessary information.
•Missing data can significantly impact model performance. Handling missing values is a
critical task in preprocessing.
3) Consistency: Data should be consistent across different datasets and formats.
•Example: If some entries use one date format while others use another, consistency
issues can arise.
4) Timeliness: The data should be up-to-date and relevant to the model's purpose.
•Example: In a stock market prediction model, outdated data could lead to poor decision-
making.
5) Relevance: The data should be directly relevant to the problem the model is trying to
solve.
•Irrelevant features (or noise) can negatively impact the model.
6) Validity: The data should adhere to defined rules and constraints (e.g., no negative
values where only positive values are expected).
•Example: A dataset containing age values as negative numbers would be invalid for an
age prediction model.
7) Integrity: The data should have logical integrity, with no contradictory information.
•Example: A dataset of customers should not have both "age" and "birthdate" values that
contradict each other.
Common Data Quality Issues
1) Missing Data: It’s common to have missing values in a dataset. Common
approaches for dealing with missing data include:
• Imputation: Filling in missing values using statistical techniques (mean, median,
mode, etc.).
• Deletion: Removing rows or columns with missing values.
• Prediction models: Using a machine learning model to predict the missing values
based on available data.
2) Outliers: Extreme values in the data that can distort the results.
• Detection: Methods like Z-score, IQR (Interquartile Range), and visual tools like box
plots.
• Handling: Either removing outliers or transforming them to be more consistent with
the rest of the data.
3) Duplicates: Identical or near-identical entries in the dataset can lead to biased
models.
•Identification and Removal: Detect and remove duplicate records, which may be
present due to data entry errors.
4) Inconsistent Data: Data discrepancies arising from different formats or scales.
•Normalization and Standardization: Ensuring features are scaled appropriately
and consistent.
5) Noise: Unwanted variations in the data, often arising from incorrect measurements.
•Filtering: Smoothing techniques or feature engineering can help reduce the
impact of noise.
6) Label Errors: In supervised learning, incorrect labels can severely degrade model
performance.
• Data Annotation Review: Ensuring correct labeling through manual review or
automated tools.
Data Remediation
Data remediation is the process of identifying and correcting issues in the data to make
it usable for machine learning models. The remediation process usually includes the
following steps:
1.Data Collection and Assessment: Understand the data sources, evaluate data
quality, and identify issues (e.g., missing values, incorrect entries).
2.Data Cleaning:
1. Missing Value Treatment: Apply imputation techniques or remove records with
missing values.
2. Outlier Handling: Identify outliers using statistical methods or domain knowledge
and decide whether to remove or adjust them.
3. Noise Reduction: Apply smoothing or transformation methods to reduce noise.
3) Data Transformation:
•Normalization/Standardization: Scale numerical features to ensure they are
comparable across the model.
•Feature Engineering: Create new features or combine existing ones to improve the
model’s learning process.
•Encoding Categorical Variables: Convert categorical features into numerical form
using techniques like one-hot encoding, label encoding, etc.
4) Data Augmentation: In cases where data is insufficient, generating synthetic data or using
augmentation techniques can improve model robustness.
5) Data Validation: After cleaning and preprocessing, validate the data to ensure it meets the
desired quality standards. This includes checking for duplicates, incorrect values, or structural
issues.
6) Model-Specific Data Remediation:
•Different ML algorithms might have specific requirements. For instance, decision trees can
handle missing values better than neural networks, which may need the data to be fully
imputed or cleaned before training.
7) Monitoring Data Quality Over Time: After the model has been deployed, it’s important to
continue monitoring the quality of incoming data to ensure consistent model performance.
Drift detection techniques can help identify when the data distribution changes.
Best Practices for Maintaining Data
Quality
1.Establish Clear Data Governance: Define data quality standards and ensure
consistent collection practices.
2.Automate Data Cleaning Processes: Use automated pipelines to clean and
preprocess data. This reduces human error and ensures consistency.
3.Regular Audits and Validation: Periodically assess the data to identify and correct
issues.
4.Incorporate Domain Expertise: Collaborate with domain experts to understand the
context of the data and guide remediation strategies.
5.Use Data Quality Tools: Leverage specialized tools for data profiling, cleaning, and
monitoring, such as Talend, Alteryx, or open-source libraries like pandas, NumPy, and
Scikit-learn in Python.
Data Preprocessing
Data preprocessing is a crucial step in machine learning because raw data is often messy,
inconsistent, and may not be in a format that is suitable for modeling. The goal of data
preprocessing is to clean, transform, and organize the data into a format that improves the
accuracy and efficiency of the machine learning model. Below are the key steps involved in data
preprocessing:
Pre-processing Workflow:
1.Import the data.
2.Explore the data to understand its structure, check for missing values, and identify potential
outliers.
3.Clean the data by handling missing values and correcting errors.
4.Transform categorical features (if any) and scale numerical features.
5.Create new features if necessary (e.g., log transformations, interactions).
6.Split the data into training, testing, and validation sets.
7.Train the model using the preprocessed data.
8.Evaluate the model using performance metrics.
1. Data Cleaning
• Handling Missing Values: Many datasets have missing or null values. These can be handled in
different ways:
• Remove rows or columns with too many missing values.
• Impute missing values with mean, median, mode, or using more sophisticated techniques like KNN
imputation.
• Predict missing values using models if required.
• Handling Outliers: Outliers can skew the results of some machine learning algorithms. They can be:
• Removed or replaced with more reasonable values.
• Detected using statistical methods (e.g., Z-score or IQR) or visual methods (e.g., box plots).
• Correcting Errors: Errors in the data, such as typos or incorrect labels, should be fixed. These might
come from human input errors, sensors, or poor data collection practices.
2. Data Transformation
• Normalization / Scaling: In some models (like k-NN, SVM, and neural networks), the scale of the features can affect
performance. Features may be rescaled to a common range, typically [0, 1] (Min-Max scaling) or standardized to have
a mean of 0 and a standard deviation of 1 (Z-score normalization).
• Encoding Categorical Data: Machine learning models often require numerical input, so categorical features need to be
encoded:
• One-Hot Encoding: Convert each category into a new binary column (e.g., red -> [1, 0, 0], blue -> [0, 1, 0]).
• Label Encoding: Assign each category a unique integer label (e.g., red -> 0, blue -> 1).
• Target Encoding: Replaces categories with the mean of the target variable for each category.
• Feature Engineering: Creating new features from existing ones can help the model learn better patterns:
• Polynomial features: For nonlinear relationships.
• Decomposition: Techniques like PCA (Principal Component Analysis) for dimensionality reduction.
• Binning: Discretizing continuous variables into categories.
• Log Transformation: For skewed data, applying a log transformation can help normalize the distribution.
3. Data Reduction
•Dimensionality Reduction: Techniques like PCA, t-SNE, or autoencoders reduce the number of
features while retaining as much information as possible. This can improve computational
efficiency and sometimes even performance.
•Feature Selection: Identifying and selecting only the most important features, which can be
done through:
• Filter methods (e.g., correlation matrix, Chi-square test).
• Wrapper methods (e.g., Recursive Feature Elimination).
• Embedded methods (e.g., Lasso, Decision Trees).
4. Splitting the Data
•Training and Testing Data: The dataset is typically split into training and testing sets (commonly
a 70-30 or 80-20 split). Sometimes, a validation set is also created from the training data to tune
hyperparameters.
•Cross-Validation: Cross-validation is used to assess how well a model generalizes to an
independent dataset. It helps ensure the model doesn't overfit or underfit.
5. Handling Imbalanced Data
•Resampling Techniques:
• Oversampling the minority class (e.g., SMOTE).
• Undersampling the majority class.
•Cost-sensitive Learning: Adjust the algorithm to weigh the minority class more heavily during
training.
6. Feature Scaling (For Certain Models)
•Standardization (Z-score): Subtract the mean and divide by the standard deviation.
•Min-Max Scaling: Scale the features to a specific range, often [0, 1].
Training a Model (for Supervised Learning),
Training a model for supervised learning involves the process of teaching the machine learning
algorithm to make predictions or classifications based on labeled data. Supervised learning is
used when the dataset consists of input-output pairs, where the input is the data (features) and
the output is the correct label or value (target).
Training a model for supervised learning involves several steps: selecting a suitable model,
preprocessing the data, training the model, tuning hyperparameters, evaluating performance,
and possibly optimizing the model for better results. Once the model has been trained and
validated, it can be deployed to make predictions on new data, with ongoing monitoring and
maintenance to ensure its continued effectiveness.
Detailed description of steps to train a model for supervised learning is provided in section as
follows.
1. Prepare the Dataset
•Input Features (X): These are the variables or data points that the model will use to make
predictions. They can be numerical or categorical data.
•Output Labels (Y): These are the true values or labels associated with each input. In regression
tasks, these are continuous values, and in classification tasks, they are discrete categories
(classes).
•The dataset is typically split into training and testing sets, with the training set used to teach the
model and the testing set to evaluate its performance.
2. Choose a Model
Select an appropriate algorithm for your supervised learning task:
•Classification Models: If the output labels are discrete categories, common algorithms
include:
• Logistic Regression
• Decision Trees
• k-Nearest Neighbors (k-NN)
• Support Vector Machines (SVM)
• Naive Bayes
• Random Forests
• Neural Networks
•Regression Models: If the output labels are continuous values, algorithms
like:
• Linear Regression
• Decision Trees
• Random Forest Regression
• Support Vector Regression (SVR)
• k-Nearest Neighbors Regression
• Neural Networks can be used.
3. Preprocess the Data
Before training the model, it is essential to preprocess the data:
•Handle missing values (e.g., impute or remove them).
•Normalize or scale the features (important for algorithms like k-NN, SVM, and neural networks).
•Encode categorical features (e.g., using one-hot encoding or label encoding).
•Optionally, create new features (feature engineering) to improve the model's performance.
4. Train the Model
•Model Initialization: Instantiate the chosen model with its hyperparameters (e.g., learning rate,
depth of tree, number of neighbors, etc.).
•Fitting the Model: Use the training data (input features and output labels) to "train" the model.
The model will adjust its internal parameters to minimize the error (for regression) or maximize
accuracy (for classification) based on a loss function. This is done through various optimization
algorithms, such as gradient descent or decision tree splitting criteria.
•Learning Process: During training, the model iteratively learns patterns from the input features
and adjusts to reduce discrepancies between its predictions and the actual target values.
5. Tune Hyperparameters
•Hyperparameters are settings that control the training process (e.g., the number of trees in a
random forest or the learning rate in gradient descent). Tuning these hyperparameters can
significantly improve model performance.
•Grid Search: An exhaustive search over a manually specified hyperparameter grid.
•Random Search: A random search over hyperparameters.
•Cross-Validation: Using cross-validation techniques (e.g., k-fold cross-validation) helps in
selecting the best set of hyperparameters and evaluating the model's generalization
performance.
6. Evaluate the Model
• Testing the Model: Once the model is trained, use the testing set (data the model has not seen before) to assess its
performance. This provides an estimate of how well the model generalizes to new, unseen data.
• Performance Metrics: The evaluation metric depends on the type of problem:
• For Classification:
• Accuracy
• Precision, Recall, and F1-Score
• Confusion Matrix
• ROC Curve and AUC
• For Regression:
• Mean Squared Error (MSE)
• Root Mean Squared Error (RMSE)
• Mean Absolute Error (MAE)
• R-squared
If the model performance is not satisfactory, further improvements might include adjusting hyperparameters, changing
the model, or collecting more data.
7. Model Optimization (Optional)
•Ensemble Methods: Combining multiple models to improve performance (e.g., Random Forest,
Gradient Boosting, AdaBoost).
•Regularization: Techniques like L1 (Lasso) or L2 (Ridge) regularization can prevent overfitting by
penalizing large coefficients in linear models.
•Cross-Validation: Used to assess the model on multiple subsets of the training data and ensure
it is not overfitting.
8. Deploy the Model
•Once the model is well-trained and evaluated, it can be deployed for use in
real-world scenarios. This might involve:
•Saving the Model: Save the trained model using libraries like pickle or
joblib in Python.
•Integration: The model is integrated into an application or system to
make predictions on new data.
9. Monitor and Maintain the Model
•Even after deployment, it's essential to monitor the model's performance in the real world. The
model may need to be retrained periodically as new data is collected, or if the underlying data
distribution changes over time (concept drift).
Evaluating Performance of a
Model
Evaluating machine learning models is crucial to determine how well they perform and how
accurately they can make predictions on new, unseen data. The evaluation process helps you
understand whether your model is underfitting, overfitting, or generalizing well. Evaluation
metrics depend on the type of problem you're solving—classification, regression, or other tasks.
Here’s an overview of how to evaluate machine learning models:
Also Refer following link for understanding of accuracy, recall,precision and f1-score.
• Where:
• TP = True Positives
• TN = True Negatives
• FP = False Positives
• FN = False Negatives
Use Case: Accuracy is a simple and commonly used metric but can be misleading when dealing with imbalanced classes (e.g., if
one class dominates the dataset).
1.2 Precision
•Definition: The proportion of positive predictions that are actually correct. It is the ability of the
model to not label as positive a negative sample.
• Formula:
• Use Case: Precision is important when false positives are costly or undesirable (e.g., spam detection,
where incorrectly labeling a legitimate email as spam is a mistake).
1.3 Recall (Sensitivity or True Positive Rate)
•Definition: The proportion of actual positive samples that are correctly identified. It measures
the model’s ability to capture all the positive instances.
•Use Case: Recall is crucial when false negatives are costly or risky (e.g., medical diagnoses,
where missing a positive case could have serious consequences).
1.4 F1-Score
•Definition: The harmonic mean of precision and recall. It balances the trade-off between
precision and recall.
•Use Case: The F1-score is especially useful when you need a balance between precision and
recall, and the classes are imbalanced.
1.5 Confusion Matrix
•Definition: A table that describes the performance of a classification algorithm by comparing the
predicted and actual values.
• The matrix consists of:
• True Positives (TP): Correctly predicted positive class instances.
• True Negatives (TN): Correctly predicted negative class instances.
• False Positives (FP): Incorrectly predicted positive class instances.
• False Negatives (FN): Incorrectly predicted negative class instances.
1.6 ROC Curve and AUC (Area Under the Curve)
•ROC Curve: A plot of the True Positive Rate (Recall) against the False Positive Rate (FPR) for
different thresholds.
•AUC: The area under the ROC curve; a measure of how well the model distinguishes between
classes.
• AUC ranges from 0 to 1, where 1 represents perfect classification and 0.5 represents random
guessing.
2. Regression Model Evaluation Metrics
For regression problems (where the output is a continuous value), evaluation metrics typically
include:
2.1 Mean Absolute Error (MAE)
•Definition: The average of the absolute differences between predicted and actual values.
• Formula:
• Use Case: MAE is useful when you want a simple, interpretable error metric. It treats all errors
equally.
2.2 Mean Squared Error (MSE)
•Definition: The average of the squared differences between the predicted and actual values.
Larger errors are penalized more heavily.
• Formula:
• Use Case: MSE is sensitive to outliers because the errors are squared, making it suitable when large
errors should be penalized more.
2.3 Root Mean Squared Error (RMSE)
•Definition: The square root of the MSE, which brings the error back to the same scale as the
original data.
• Formula:
• Use Case: RMSE provides a more interpretable error metric in the same units as the target variable
and penalizes large errors more than MAE.
2.4 R-squared (R²)
•Definition: A statistical measure that explains the proportion of variance in the target variable
that is explained by the model.
• Formula:
• Use Case: R² indicates how well the model fits the data. A value of 1 means the model explains all
variance, while 0 means it explains none.
3. Cross-Validation
•Definition: Cross-validation is a technique to assess how well the model generalizes by splitting
the data into multiple subsets (folds) and training/testing the model multiple times.
• k-Fold Cross-Validation: The data is split into k equally sized folds. The model is trained on k-1 folds
and tested on the remaining fold. This is repeated for each fold.
• Stratified k-Fold: Ensures that each fold has the same class distribution (useful for imbalanced
datasets).
4. Overfitting and Underfitting
•Overfitting: The model learns the training data too well, including noise and outliers, which can
result in poor performance on unseen data. High variance is associated with overfitting.
•Underfitting: The model is too simple and fails to capture important patterns in the training
data, leading to poor performance on both the training and test sets. High bias is associated
with underfitting.
5. Model Comparison
•Once multiple models have been trained, it's important to compare their evaluation metrics to
select the best-performing model for your problem. For classification tasks, this could mean
comparing accuracy, precision, recall, or F1-score. For regression, it could mean comparing MAE,
MSE, RMSE, or R².
Summary
Evaluating machine learning models involves using appropriate performance metrics to assess
how well the model generalizes to unseen data. The choice of metric depends on the problem
type (classification vs. regression) and the specific goals of the model. Tools like confusion
matrices, ROC curves, cross-validation, and various error metrics help in understanding model
performance, improving it, and choosing the best model for deployment.
Feature Engineering
Feature engineering is the process of selecting, manipulating, and
transforming raw data into features that can be used in supervised learning.
In order to make machine learning work well on new tasks, it might be
necessary to design and train better features.
As you may know, a “feature” is any measurable input that can be used in a
predictive model — it could be the color of an object or the sound of
someone’s voice.
Feature engineering, in simple terms, is the act of converting raw
observations into desired features using statistical or machine
learning approaches.
What is Feature Engineering
Feature engineering is a machine learning technique that leverages data to
create new variables that aren’t in the training set. It can produce new
features for both supervised and unsupervised learning, with the goal
of simplifying and speeding up data transformations while
also enhancing model accuracy. Feature engineering is required when
working with machine learning models. Regardless of the data or
architecture, a terrible feature will have a direct impact on your model.
Example for feature
engineering
Below are the prices of properties in x city. It shows the area of the
house and total price.
Now this data might have some errors or might be incorrect, not all
sources on the internet are correct. To begin, we’ll add a new column
to display the cost per square foot.
This new feature will help us understand a lot about our data. So, we have a
new column which shows cost per square ft. There are three main ways you
can find any error. You can use Domain Knowledge to contact a property
advisor or real estate agent and show him the per square foot rate. If your
counsel states that pricing per square foot cannot be less than 3400, you
may have a problem. The data can be visualised.
1.Imputation
When it comes to preparing your data for machine learning, missing values are
one of the most typical issues. Human errors, data flow interruptions, privacy
concerns, and other factors could all contribute to missing values. Missing values
have an impact on the performance of machine learning models for whatever
cause. The main goal of imputation is to handle these missing values. There are
two types of imputation :
Numerical Imputation: To figure out what numbers should be assigned to people
currently in the population, we usually use data from completed surveys or
censuses. These data sets can include information about how many people eat
different types of food, whether they live in a city or country with a cold climate,
and how much they earn every year. That is why numerical imputation is used to
fill gaps in surveys or censuses when certain pieces of information are missing.
#Filling all missing values with 0
data = data.fillna(0)
Categorical Imputation: When dealing with categorical columns, replacing
missing values with the highest value in the column is a smart solution.
However, if you believe the values in the column are evenly distributed and
there is no dominating value, imputing a category like “Other” would be a
better choice, as your imputation is more likely to converge to a random
selection in this scenario.
#Max fill function for categorical columns
data[‘column_name’].fillna(data[‘column_name’].value_counts().idxmax(),
inplace=True)
2.Handling Outliers
Outlier handling is a technique for removing outliers from a dataset. This
method can be used on a variety of scales to produce a more accurate data
representation. This has an impact on the model’s performance. Depending
on the model, the effect could be large or minimal; for example, linear
regression is particularly susceptible to outliers. This procedure should be
completed prior to model training. The various methods of handling outliers
include:
1.Removal: Outlier-containing entries are deleted from the distribution.
However, if there are outliers across numerous variables, this strategy may
result in a big chunk of the datasheet being missed.
2.Replacing values: Alternatively, the outliers could be handled as missing
values and replaced with suitable imputation.
3.Capping: Using an arbitrary value or a value from a variable distribution to
replace the maximum and minimum values.
Discretization : Discretization is the process of converting continuous
variables, models, and functions into discrete ones. This is accomplished by
constructing a series of continuous intervals (or bins) that span the range of
our desired variable/model/function.
3.Log Transform
Log Transform is the most used technique among data scientists. It’s mostly
used to turn a skewed distribution into a normal or less-skewed distribution.
We take the log of the values in a column and utilise those values as the
column in this transform. It is used to handle confusing data, and the data
becomes more approximative to normal applications.
//Log Example
df[log_price] = np.log(df[‘Price’])
4.One-hot encoding
A one-hot encoding is a type of encoding in which an element of a finite set is
represented by the index in that set, where only one element has its index
set to “1” and all other elements are assigned indices within the range [0, n-
1]. In contrast to binary encoding schemes, where each bit can represent 2
values (i.e. 0 and 1), this scheme assigns a unique value for each possible
case.
5.Scaling
Feature scaling is one of the most pervasive and difficult problems in
machine learning, yet it’s one of the most important things to get right. In
order to train a predictive model, we need data with a known set of features
that needs to be scaled up or down as appropriate. This blog post will explain
how feature scaling works and why it’s important as well as some tips for
getting started with feature scaling.
After a scaling operation, the continuous features become similar in terms of
range. Although this step isn’t required for many algorithms, it’s still a good
idea to do so. Distance-based algorithms like k-NN and k-Means, on the other
hand, require scaled continuous features as model input. There are two
common ways for scaling :
Normalization : All values are scaled in a specified range between 0 and 1
via normalisation (or min-max normalisation). This modification has no
influence on the feature’s distribution, however it does exacerbate the effects
of outliers due to lower standard deviations. As a result, it is advised that
outliers be dealt with prior to normalisation.
Standardization: Standardization (also known as z-score normalisation) is
the process of scaling values while accounting for standard deviation. If the
standard deviation of features differs, the range of those features will likewise
differ. The effect of outliers in the characteristics is reduced as a result. To
arrive at a distribution with a 0 mean and 1 variance, all the data points are
subtracted by their mean and the result divided by the distribution’s
variance.
Few Best Feature Engineering Tools
There are many tools which will help you in automating the entire feature
engineering process and producing a large pool of features in a short period
of time for both classification and regression tasks.
FeatureTools
Featuretools is a framework to perform automated feature engineering. It
excels at transforming temporal and relational datasets into feature matrices
for machine learning. Featuretools integrates with the machine learning
pipeline-building tools you already have. In a fraction of the time it would
take to do it manually, you can load in pandas dataframes and automatically
construct significant features.
FeatureTools Summary
•Easy to get started, good documentation and community support
•It helps you construct meaningful features for machine learning and predictive
modelling by combining your raw data with what you know about your data.
•It provides APIs to verify that only legitimate data is utilised for calculations,
preventing label leakage in your feature vectors.
•Featuretools includes a low-level function library that may be layered to
generate features.
•Its AutoML library(EvalML) helps you build, optimize, and evaluate machine
learning pipelines.
•Good at handling relational databases.
2) AutoFeat
AutoFeat helps to perform Linear Prediction Models with Automated Feature
Engineering and Selection. AutoFeat allows you to select the units of the input
variables in order to avoid the construction of physically nonsensical features.
AutoFeat Summary
•AutoFeat can easily handle categorical features with One hot encoding.
•The AutoFeatRegressor and AutoFeatClassifier models in this package have a
similar interface to scikit-learn models
•General purpose automated feature engineering which is Not good at handling
relational data.
•It is useful in logistical data
TsFresh
tsfresh is a python package. It calculates a huge number of time series
characteristics, or features, automatically. In addition, the package includes
methods for assessing the explanatory power and significance of such traits
in regression and classification tasks.
TsFresh Summary
•It is Best open source python tool available for time series classification and
regression.
•It helps to extract things such as the number of peaks, average value,
maximum value, time reversal symmetry statistic, etc.
•It can be integrated with FeatureTools.
OneBM
OneBM interacts directly with a database’s raw tables. It slowly joins the
tables, taking different paths on the relational tree. It recognises simple data
types (numerical or categorical) and complicated data types (set of numbers,
set of categories, sequences, time series, and texts) in the joint results and
applies pre-defined feature engineering approaches to the supplied types.
•Both relational and non-relational data are supported.
•When compared to FeatureTools, it generates both simple and complicated
features.
•It was put to the test in Kaggle competitions, and it outperformed state-of-
the-art models.
ExploreKit
Based on the idea that extremely informative features are typically the
consequence of manipulating basic ones, ExploreKit identifies common
operators to alter each feature independently or combine multiple of them.
Instead of running feature selection on all developed features, which can be
quite huge, meta learning is used to rank candidate features.
Comparision of Feature
Engineering Tools
Conclusion of feature
engineering
Feature engineering is the development of new data features from raw data.
With this technique, engineers analyze the raw data and potential
information in order to extract a new or more valuable set of features.
Feature engineering can be seen as a generalization of mathematical
optimization that allows for better analysis.
Different Types of Plots in
Machine Learning
In machine learning, various plots are used to visualize data, model performance, and the relationships
between features. Here are some commonly used plots:
1. Scatter Plot
• Purpose: To show the relationship between two continuous variables.
• When to use: If you're trying to understand how two variables correlate with each other.
• Example: Plotting "Age" vs. "Salary" to check how age correlates with salary.
2. Line Plot
• Purpose: Used to display data trends over a continuous variable (often time).
• When to use: To visualize changes over time or continuous variables.
• Example: Plotting training loss vs. number of epochs to check if the model is improving over time.
3. Histogram
•Purpose: Shows the distribution of a single variable.
•When to use: To understand how a feature is distributed or the frequency of different ranges of
values.
•Example: Plotting the distribution of ages or salaries.
4. Box Plot (Box-and-Whisker Plot)
•Purpose: To visualize the distribution of data and identify outliers.
•When to use: When you want to show the median, quartiles, and detect outliers.
•Example: Comparing the salary distribution across different departments.
5. Pair Plot (or Scatterplot Matrix)
•Purpose: Displays pairwise relationships between multiple continuous variables.
•When to use: To see correlations or patterns between multiple features in a dataset.
•Example: Pair plotting all features of the Iris dataset to see relationships.
6. Heatmap
•Purpose: Shows the correlation matrix between variables or the value of a matrix.
•When to use: To quickly identify patterns, correlations, or cluster similarity.
•Example: Showing the correlation between different features in a dataset.
7. ROC Curve (Receiver Operating Characteristic Curve)
•Purpose: Used for evaluating the performance of a binary classification model.
•When to use: To assess a classifier's ability to distinguish between classes.
•Example: Plotting the true positive rate vs. the false positive rate for different thresholds.
8. Precision-Recall Curve
•Purpose: Visualizes the trade-off between precision and recall for different thresholds.
•When to use: When dealing with imbalanced classes and you want to evaluate model
performance based on recall and precision.
•Example: A classifier predicting rare events, like fraud detection.
9. Confusion Matrix
•Purpose: A table used to evaluate the performance of a classification model.
•When to use: To show the true positives, false positives, true negatives, and false negatives.
•Example: For a binary classifier, showing how many correct and incorrect predictions were
made.
10. Learning Curve
•Purpose: Plots training and validation error against training size or epochs.
•When to use: To detect overfitting or underfitting in your model by showing performance trends
as more data or epochs are added.
•Example: Showing how error decreases with more training data or epochs in a neural network.
11. Feature Importance Plot
• Purpose: Visualizes the importance of each feature in a model's decision-making process.
• When to use: To understand which features contribute most to the prediction in models like
Random Forest or XGBoost.
• Example: Plotting feature importances from a trained decision tree model.
12. Calibration Curve
• Purpose: Compares predicted probabilities with actual outcomes.
• When to use: When you need to evaluate how well the model’s predicted probabilities align with
the true probabilities.
• Example: Checking if predicted probabilities from a classifier match the true likelihood of the event.
13. t-SNE Plot (t-Distributed Stochastic Neighbor Embedding)
•Purpose: A dimensionality reduction technique used to visualize high-dimensional data in 2D or
3D.
•When to use: To visualize high-dimensional data, like word embeddings or neural network
activations.
•Example: Visualizing clusters in a high-dimensional feature space, such as images or text data.
14. PCA Plot (Principal Component Analysis)
•Purpose: Reduces dimensionality while preserving variance in the data.
•When to use: To visualize high-dimensional data in a lower-dimensional space.
•Example: Plotting the first two principal components of a dataset to explore clusters or trends.
15. Decision Boundary Plot
•Purpose: Displays the decision boundary of a classification model.
•When to use: To visualize how well a model separates different classes in 2D space.
•Example: For a binary classifier, showing how the model divides the feature space between the
two classes.