0% found this document useful (0 votes)

35 views100 pages

AIML Chapter 4

Chapter 4 covers the basics of machine learning, focusing on data types, modeling, and evaluation. It categorizes data into numerical, categorical, text, image, time-series, binary, boolean, structured, unstructured, and geospatial data, emphasizing the importance of data quality and remediation. The chapter also discusses data exploration techniques, feature engineering, and the significance of understanding data structure for effective model training.

Uploaded by

12302130603011

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views100 pages

AIML Chapter 4

Uploaded by

12302130603011

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 100

Chapter-4:

Basics of Machine
Learning
P R O F. I S H I TA T H E B A
C O M P U T E R E N G I N E E R I N G D E P T.
ADIT
Topics
Preparing to Model:
◦ Basic Types of Data in Machine Learning,
◦ Exploring Structure of Data,
◦ Data Quality and Remediation,
◦ Data Preprocessing

Modeling and Evaluation:

◦ Training a Model (for Supervised Learning),
◦ Model Representation and Interpretability,
◦ Evaluating Performance of a Model
◦ Feature Engineering: Feature Transformation and Feature Selection
Basic Types of Data in Machine
Learning
In machine learning, data types play a crucial role in determining the nature of the model, how it
processes data, and the kind of algorithms that can be applied.
The types of data in machine learning can be broadly categorized into the following types:
1. Numerical Data:
This type of data is composed of numbers and can be further divided into:
• Discrete: These are countable values. For example, the number of children in a family (1, 2, 3, etc.).
• Continuous: These represent measurable quantities that can take any value within a given range. For
example, temperature, height, or weight.
Example:
• Discrete: Number of cars sold in a day.
• Continuous: Price of a product
Use Cases in ML:
•Regression algorithms (e.g., linear regression, support vector regression) often handle numerical
data.
•Decision trees and k-nearest neighbors (KNN) also work well with numerical data.
2. Categorical Data:
Categorical data represents categories or labels that have no inherent order or ranking. It can be of two types:
• Nominal: Data with no specific order. Examples: color (red, green, blue), gender (male, female), city names.
• Ordinal: Data with a meaningful order or ranking, but the differences between the categories may not be
defined. Examples: Education level (High School < Bachelor’s < Master’s).
Example:
• Nominal: Types of fruits (apple, banana, orange).
• Ordinal: Likert scale ratings (Strongly Agree, Agree, Neutral, Disagree, Strongly Disagree).
Use Cases in ML:
• Categorical data is often encoded using techniques like One-Hot Encoding or Label Encoding.
• Algorithms such as Decision Trees, Naive Bayes, and Random Forests handle categorical data effectively.
3. Text Data:
Text data consists of sequences of words or sentences and is common in Natural Language
Processing (NLP). It can be transformed into numerical representations using techniques like:
•Bag of Words (BoW): A representation where each word is treated as a feature.
•TF-IDF (Term Frequency-Inverse Document Frequency): A more advanced technique that
weighs the importance of a word in a document based on its frequency.
•Word Embeddings: Using techniques like Word2Vec, GloVe, or FastText to capture semantic
relationships between words.
Example:
•Text data: "Machine learning is fun.“
Use Cases in ML:
•Text classification (spam detection, sentiment analysis, etc.)
•Named Entity Recognition (NER)
•Machine Translation
4. Image Data:
Image data is represented as pixel values and is usually in the form of arrays (matrices) where
each pixel contains values for color channels (RGB) or grayscale values. Images are typically high-
dimensional data.
Example:
•An image of a cat, represented as a 3D matrix (height, width, color channels).
Use Cases in ML:
•Computer vision tasks (image classification, object detection, image segmentation, etc.)
•Deep learning models like Convolutional Neural Networks (CNNs) are often used with image
data.
5. Time-Series Data:
Time-series data consists of data points indexed in time order. It is a sequence of data collected
over time, often used to predict future trends or behaviors.
Example:
•Stock prices over time, temperature readings over a day, or website traffic data.
Use Cases in ML:
•Forecasting (e.g., ARIMA models, LSTM networks for long-term dependencies).
Anomaly detection in time series data.
6. Binary Data:
Binary data consists of only two possible values, often representing two classes (like yes/no, 1/0,
true/false).
Example:
•A binary classification problem like spam detection (spam or not spam).
Use Cases in ML:
•Binary classification algorithms (e.g., Logistic Regression, SVM).
•Anomaly detection.
7. Boolean Data:
A special case of binary data where each value is either true or false.
Example:
•Whether a user clicked on an advertisement (True/False).
Use Cases in ML:
•Typically used in classification problems.
8. Structured and Unstructured Data:
•Structured Data: Data that is highly organized and typically stored in tables, with rows and
columns (e.g., spreadsheets, relational databases). Most machine learning models work with
structured data.
•Unstructured Data: Data that doesn't follow a specific format or organization (e.g., raw text,
images, videos).
9. Geospatial Data:
Data that contains information about locations or geographic features. It may include
coordinates (latitude, longitude) or maps.
Example:
•GPS coordinates, location-based data (e.g., city, country).
Use Cases in ML:
•Location-based services, geospatial analysis, autonomous vehicles.
Exploring Structure of Data
Exploring the structure of data in machine learning is a crucial step in
understanding how the data behaves, what patterns it might contain, and
how best to preprocess and transform it for building a model.
The structure of data refers to the organization and representation of
data, which could come in various formats depending on the problem.
Key Elements of Data
Structure in Machine Learning
1) Types of Data
• Numerical Data: Data that consists of numbers, such as age, price, temperature, etc. Numerical data can be continuous
(like temperature) or discrete (like the number of people in a room).
• Categorical Data: Data that represents categories or classes, such as "red," "green," and "blue" or "yes" and "no."
Categorical data can be either nominal (no inherent order, e.g., color) or ordinal (inherent order, e.g., "low", "medium",
"high").
• Text Data: Data in the form of text, such as reviews, tweets, documents, etc. Text data requires special handling like
tokenization and vectorization to make it usable for machine learning models.
• Time Series Data: Data that is indexed by time, such as stock prices, weather patterns, and sensor data. Time series data
is usually sequential, and its structure requires careful consideration of trends, seasonality, and autocorrelation.
• Image Data: Data represented as images (e.g., pixels), commonly used in computer vision tasks. Image data is typically
represented in a grid of pixel values, where each pixel has color information (RGB).
• Audio Data: Data represented as sound signals, often processed as waveforms or spectrograms.
• Multimedia Data: Data that combines multiple types, such as text combined with images or video (e.g., captioning or
video recognition tasks).
2) Data Format
• Tabular Data: Data that is organized into rows and columns, similar to a spreadsheet
or database table. It’s the most common format for structured data. Columns
represent features, and rows represent individual data points or records.
• Sparse Data: A type of tabular data where many values are zeros or missing. Sparse
matrices are common in text data (like one-hot encoding) or when working with high-
dimensional data.
• Structured vs Unstructured Data: Structured data is organized in a predictable
format, such as in relational databases (tables, rows, and columns), while
unstructured data lacks a predefined structure, like raw text or images.
3) Relationships Between Features
• Feature Correlation: Some features might be correlated with each other, meaning
that they share a linear relationship. Understanding correlations can help with feature
selection, scaling, and even identifying redundancy in data.
• Interaction Between Features: Certain features might interact in non-linear ways.
For example, in a model predicting house prices, the interaction between "square
footage" and "location" may be significant. Feature engineering can help identify
these interactions.
• Feature Redundancy: Some features might convey the same information
(multicollinearity). If two features are highly correlated, one can often be removed
without losing much information, improving model efficiency.
4) Dimensionality of the Data

•High-dimensional Data: In some domains (e.g., image or text), data can have a
very high number of features (dimensions), which can make the modeling process
more complex and increase the risk of overfitting.
•Low-dimensional Data: In contrast, low-dimensional data often has fewer features
but might require better techniques for extracting useful patterns.
•Curse of Dimensionality: As the number of features increases, the amount of data
required to adequately cover the feature space grows exponentially. Techniques like
PCA (Principal Component Analysis) or feature selection are often used to reduce
dimensionality.
5) Data Distribution:
•Understanding how data is distributed (uniform, Gaussian, skewed) is key to choosing
appropriate models and preprocessing techniques. For instance:
• Skewed Data: In classification problems, imbalanced classes can lead to biased models.
Techniques like resampling or using balanced accuracy scores can help.
• Normal Distribution: Many models assume data is normally distributed (e.g., linear
regression). For this reason, checking and transforming data (e.g., log transformation) might
be necessary.
• Outliers: Identifying and dealing with outliers is important. Outliers might indicate errors,
rare events, or valuable insights, and how you handle them can significantly impact model
performance.
Exploring Data Structure
1) Initial Exploration
•Data Inspection:
•Head/Tail Inspection: Use commands like .head() or .tail() in
Pandas (for Python) to quickly see the top or bottom rows of a
dataset. This gives you an initial sense of the data.
•Shape: Check the dimensions of your dataset (.shape in Pandas) to
understand how many rows (data points) and columns (features) it
has.
2) Descriptive Statistics
• Summary Statistics: Use .describe() for numerical data to get an overview of central
tendencies, spread, and any potential anomalies (like extremely high or low values).
•For categorical data, you might use .value_counts() to see the frequency of each class.
•Visualize with histograms, boxplots, or scatter plots to understand data distributions and
relationships.
•Correlation Matrix: For numerical data, a correlation matrix can show how features are
related to one another. A heatmap is often used to visualize these correlations.

3) Handling Categorical Data

•Uniqueness: Look at the unique values in categorical columns. Are there unexpected
categories? Is there high cardinality in some features?
• Encoding: If you're working with machine learning models, you'll often need to encode
categorical data into numerical format, such as with one-hot encoding, label encoding, or
embeddings.
4) Data Visualization:
•Visualizing your data helps you better understand its structure and any relationships between
features. For example:
• Histograms: Useful for understanding distributions of numerical features.
• Boxplots: Help detect outliers and compare distributions across different categories.
• Pair Plots: Useful for seeing the relationships between multiple numerical features.
• Scatter Plots: Help to visualize correlations between two numerical features.
5) Handling Time Series Data
• Trend and Seasonality: Look for any trends (upward/downward movement) and
seasonality (regular patterns over time).
• Stationarity: Check if the data is stationary, i.e., whether its statistical properties remain
constant over time. If not, techniques like differencing can be applied to make it stationary.

6) Missing Data
•Missing Value Detection: Identify missing values using .isna() or .isnull() in pandas. Visual
tools like heatmaps can show the presence of missing data across the dataset.
•Handling Missing Data: Decide whether to impute missing values (mean, median, etc.), use
predictive models, or drop rows/columns depending on the extent of missing data.
7) Feature Engineering
• Depending on the data structure, you might need to create new features (e.g.,
extracting the year or month from a date column), perform aggregation (like taking the
average of certain groups), or transform existing features (e.g., log transformation for
skewed data).
Data Quality and Remediation
Data quality and remediation are critical components of building and deploying effective
machine learning (ML) models.
High-quality data ensures that ML models can make accurate predictions, generalizations, and
offer real value.
When the data quality is poor, remediation becomes necessary to improve it before training and
deploying models.
Data Quality in Machine Learning

Data quality refers to the characteristics of the data that influence its usefulness for
analysis and decision-making. Poor data quality can lead to inaccurate or biased
models. Key aspects of data quality include:
1) Accuracy: The data must accurately represent the real-world phenomenon being
measured.
Example: A dataset of weather information that has inaccurate temperature readings
can lead to poor predictions in weather forecasting models.
2) Completeness: The data should have all necessary information.
•Missing data can significantly impact model performance. Handling missing values is a
critical task in preprocessing.
3) Consistency: Data should be consistent across different datasets and formats.
•Example: If some entries use one date format while others use another, consistency
issues can arise.
4) Timeliness: The data should be up-to-date and relevant to the model's purpose.
•Example: In a stock market prediction model, outdated data could lead to poor decision-
making.
5) Relevance: The data should be directly relevant to the problem the model is trying to
solve.
•Irrelevant features (or noise) can negatively impact the model.
6) Validity: The data should adhere to defined rules and constraints (e.g., no negative
values where only positive values are expected).
•Example: A dataset containing age values as negative numbers would be invalid for an
age prediction model.
7) Integrity: The data should have logical integrity, with no contradictory information.
•Example: A dataset of customers should not have both "age" and "birthdate" values that
contradict each other.
Common Data Quality Issues
1) Missing Data: It’s common to have missing values in a dataset. Common
approaches for dealing with missing data include:
• Imputation: Filling in missing values using statistical techniques (mean, median,
mode, etc.).
• Deletion: Removing rows or columns with missing values.
• Prediction models: Using a machine learning model to predict the missing values
based on available data.
2) Outliers: Extreme values in the data that can distort the results.
• Detection: Methods like Z-score, IQR (Interquartile Range), and visual tools like box
plots.
• Handling: Either removing outliers or transforming them to be more consistent with
the rest of the data.
3) Duplicates: Identical or near-identical entries in the dataset can lead to biased
models.
•Identification and Removal: Detect and remove duplicate records, which may be
present due to data entry errors.
4) Inconsistent Data: Data discrepancies arising from different formats or scales.
•Normalization and Standardization: Ensuring features are scaled appropriately
and consistent.
5) Noise: Unwanted variations in the data, often arising from incorrect measurements.
•Filtering: Smoothing techniques or feature engineering can help reduce the
impact of noise.
6) Label Errors: In supervised learning, incorrect labels can severely degrade model
performance.
• Data Annotation Review: Ensuring correct labeling through manual review or
automated tools.
Data Remediation
Data remediation is the process of identifying and correcting issues in the data to make
it usable for machine learning models. The remediation process usually includes the
following steps:
1.Data Collection and Assessment: Understand the data sources, evaluate data
quality, and identify issues (e.g., missing values, incorrect entries).
2.Data Cleaning:
1. Missing Value Treatment: Apply imputation techniques or remove records with
missing values.
2. Outlier Handling: Identify outliers using statistical methods or domain knowledge
and decide whether to remove or adjust them.
3. Noise Reduction: Apply smoothing or transformation methods to reduce noise.
3) Data Transformation:
•Normalization/Standardization: Scale numerical features to ensure they are
comparable across the model.
•Feature Engineering: Create new features or combine existing ones to improve the
model’s learning process.
•Encoding Categorical Variables: Convert categorical features into numerical form
using techniques like one-hot encoding, label encoding, etc.
4) Data Augmentation: In cases where data is insufficient, generating synthetic data or using
augmentation techniques can improve model robustness.
5) Data Validation: After cleaning and preprocessing, validate the data to ensure it meets the
desired quality standards. This includes checking for duplicates, incorrect values, or structural
issues.
6) Model-Specific Data Remediation:
•Different ML algorithms might have specific requirements. For instance, decision trees can
handle missing values better than neural networks, which may need the data to be fully
imputed or cleaned before training.
7) Monitoring Data Quality Over Time: After the model has been deployed, it’s important to
continue monitoring the quality of incoming data to ensure consistent model performance.
Drift detection techniques can help identify when the data distribution changes.
Best Practices for Maintaining Data
Quality

1.Establish Clear Data Governance: Define data quality standards and ensure
consistent collection practices.
2.Automate Data Cleaning Processes: Use automated pipelines to clean and
preprocess data. This reduces human error and ensures consistency.
3.Regular Audits and Validation: Periodically assess the data to identify and correct
issues.
4.Incorporate Domain Expertise: Collaborate with domain experts to understand the
context of the data and guide remediation strategies.
5.Use Data Quality Tools: Leverage specialized tools for data profiling, cleaning, and
monitoring, such as Talend, Alteryx, or open-source libraries like pandas, NumPy, and
Scikit-learn in Python.
Data Preprocessing

Data preprocessing is a crucial step in machine learning because raw data is often messy,
inconsistent, and may not be in a format that is suitable for modeling. The goal of data
preprocessing is to clean, transform, and organize the data into a format that improves the
accuracy and efficiency of the machine learning model. Below are the key steps involved in data
preprocessing:
Pre-processing Workflow:
1.Import the data.
2.Explore the data to understand its structure, check for missing values, and identify potential
outliers.
3.Clean the data by handling missing values and correcting errors.
4.Transform categorical features (if any) and scale numerical features.
5.Create new features if necessary (e.g., log transformations, interactions).
6.Split the data into training, testing, and validation sets.
7.Train the model using the preprocessed data.
8.Evaluate the model using performance metrics.
1. Data Cleaning
• Handling Missing Values: Many datasets have missing or null values. These can be handled in
different ways:
• Remove rows or columns with too many missing values.
• Impute missing values with mean, median, mode, or using more sophisticated techniques like KNN
imputation.
• Predict missing values using models if required.

• Handling Outliers: Outliers can skew the results of some machine learning algorithms. They can be:
• Removed or replaced with more reasonable values.
• Detected using statistical methods (e.g., Z-score or IQR) or visual methods (e.g., box plots).

• Correcting Errors: Errors in the data, such as typos or incorrect labels, should be fixed. These might
come from human input errors, sensors, or poor data collection practices.
2. Data Transformation
• Normalization / Scaling: In some models (like k-NN, SVM, and neural networks), the scale of the features can affect
performance. Features may be rescaled to a common range, typically [0, 1] (Min-Max scaling) or standardized to have
a mean of 0 and a standard deviation of 1 (Z-score normalization).
• Encoding Categorical Data: Machine learning models often require numerical input, so categorical features need to be
encoded:
• One-Hot Encoding: Convert each category into a new binary column (e.g., red -> [1, 0, 0], blue -> [0, 1, 0]).
• Label Encoding: Assign each category a unique integer label (e.g., red -> 0, blue -> 1).
• Target Encoding: Replaces categories with the mean of the target variable for each category.

• Feature Engineering: Creating new features from existing ones can help the model learn better patterns:
• Polynomial features: For nonlinear relationships.
• Decomposition: Techniques like PCA (Principal Component Analysis) for dimensionality reduction.
• Binning: Discretizing continuous variables into categories.

• Log Transformation: For skewed data, applying a log transformation can help normalize the distribution.
3. Data Reduction
•Dimensionality Reduction: Techniques like PCA, t-SNE, or autoencoders reduce the number of
features while retaining as much information as possible. This can improve computational
efficiency and sometimes even performance.
•Feature Selection: Identifying and selecting only the most important features, which can be
done through:
• Filter methods (e.g., correlation matrix, Chi-square test).
• Wrapper methods (e.g., Recursive Feature Elimination).
• Embedded methods (e.g., Lasso, Decision Trees).
4. Splitting the Data
•Training and Testing Data: The dataset is typically split into training and testing sets (commonly
a 70-30 or 80-20 split). Sometimes, a validation set is also created from the training data to tune
hyperparameters.
•Cross-Validation: Cross-validation is used to assess how well a model generalizes to an
independent dataset. It helps ensure the model doesn't overfit or underfit.
5. Handling Imbalanced Data
•Resampling Techniques:
• Oversampling the minority class (e.g., SMOTE).
• Undersampling the majority class.

•Cost-sensitive Learning: Adjust the algorithm to weigh the minority class more heavily during
training.
6. Feature Scaling (For Certain Models)
•Standardization (Z-score): Subtract the mean and divide by the standard deviation.
•Min-Max Scaling: Scale the features to a specific range, often [0, 1].
Training a Model (for Supervised Learning),

Training a model for supervised learning involves the process of teaching the machine learning
algorithm to make predictions or classifications based on labeled data. Supervised learning is
used when the dataset consists of input-output pairs, where the input is the data (features) and
the output is the correct label or value (target).
Training a model for supervised learning involves several steps: selecting a suitable model,
preprocessing the data, training the model, tuning hyperparameters, evaluating performance,
and possibly optimizing the model for better results. Once the model has been trained and
validated, it can be deployed to make predictions on new data, with ongoing monitoring and
maintenance to ensure its continued effectiveness.
Detailed description of steps to train a model for supervised learning is provided in section as
follows.
1. Prepare the Dataset
•Input Features (X): These are the variables or data points that the model will use to make
predictions. They can be numerical or categorical data.
•Output Labels (Y): These are the true values or labels associated with each input. In regression
tasks, these are continuous values, and in classification tasks, they are discrete categories
(classes).
•The dataset is typically split into training and testing sets, with the training set used to teach the
model and the testing set to evaluate its performance.
2. Choose a Model
Select an appropriate algorithm for your supervised learning task:
•Classification Models: If the output labels are discrete categories, common algorithms
include:
• Logistic Regression
• Decision Trees
• k-Nearest Neighbors (k-NN)
• Support Vector Machines (SVM)
• Naive Bayes
• Random Forests
• Neural Networks
•Regression Models: If the output labels are continuous values, algorithms
like:
• Linear Regression
• Decision Trees
• Random Forest Regression
• Support Vector Regression (SVR)
• k-Nearest Neighbors Regression
• Neural Networks can be used.
3. Preprocess the Data
Before training the model, it is essential to preprocess the data:
•Handle missing values (e.g., impute or remove them).
•Normalize or scale the features (important for algorithms like k-NN, SVM, and neural networks).
•Encode categorical features (e.g., using one-hot encoding or label encoding).
•Optionally, create new features (feature engineering) to improve the model's performance.
4. Train the Model
•Model Initialization: Instantiate the chosen model with its hyperparameters (e.g., learning rate,
depth of tree, number of neighbors, etc.).
•Fitting the Model: Use the training data (input features and output labels) to "train" the model.
The model will adjust its internal parameters to minimize the error (for regression) or maximize
accuracy (for classification) based on a loss function. This is done through various optimization
algorithms, such as gradient descent or decision tree splitting criteria.
•Learning Process: During training, the model iteratively learns patterns from the input features
and adjusts to reduce discrepancies between its predictions and the actual target values.
5. Tune Hyperparameters
•Hyperparameters are settings that control the training process (e.g., the number of trees in a
random forest or the learning rate in gradient descent). Tuning these hyperparameters can
significantly improve model performance.
•Grid Search: An exhaustive search over a manually specified hyperparameter grid.
•Random Search: A random search over hyperparameters.
•Cross-Validation: Using cross-validation techniques (e.g., k-fold cross-validation) helps in
selecting the best set of hyperparameters and evaluating the model's generalization
performance.
6. Evaluate the Model
• Testing the Model: Once the model is trained, use the testing set (data the model has not seen before) to assess its
performance. This provides an estimate of how well the model generalizes to new, unseen data.
• Performance Metrics: The evaluation metric depends on the type of problem:
• For Classification:
• Accuracy
• Precision, Recall, and F1-Score
• Confusion Matrix
• ROC Curve and AUC
• For Regression:
• Mean Squared Error (MSE)
• Root Mean Squared Error (RMSE)
• Mean Absolute Error (MAE)
• R-squared

If the model performance is not satisfactory, further improvements might include adjusting hyperparameters, changing
the model, or collecting more data.
7. Model Optimization (Optional)
•Ensemble Methods: Combining multiple models to improve performance (e.g., Random Forest,
Gradient Boosting, AdaBoost).
•Regularization: Techniques like L1 (Lasso) or L2 (Ridge) regularization can prevent overfitting by
penalizing large coefficients in linear models.
•Cross-Validation: Used to assess the model on multiple subsets of the training data and ensure
it is not overfitting.
8. Deploy the Model
•Once the model is well-trained and evaluated, it can be deployed for use in
real-world scenarios. This might involve:
•Saving the Model: Save the trained model using libraries like pickle or
joblib in Python.
•Integration: The model is integrated into an application or system to
make predictions on new data.
9. Monitor and Maintain the Model
•Even after deployment, it's essential to monitor the model's performance in the real world. The
model may need to be retrained periodically as new data is collected, or if the underlying data
distribution changes over time (concept drift).
Evaluating Performance of a
Model

Evaluating machine learning models is crucial to determine how well they perform and how
accurately they can make predictions on new, unseen data. The evaluation process helps you
understand whether your model is underfitting, overfitting, or generalizing well. Evaluation
metrics depend on the type of problem you're solving—classification, regression, or other tasks.
Here’s an overview of how to evaluate machine learning models:
Also Refer following link for understanding of accuracy, recall,precision and f1-score.

Explaining Accuracy, Precision, Recall, and F1 Score | by Vikas Singh Bhadouria

| The Startup | Medium
Understanding confusion matrix
first
•Definition: A table that describes the performance of a classification algorithm by comparing the
predicted and actual values.
• The matrix consists of:
• True Positives (TP): Correctly predicted positive class instances.
• True Negatives (TN): Correctly predicted negative class instances.
• False Positives (FP): Incorrectly predicted positive class instances.
• False Negatives (FN): Incorrectly predicted negative class instances.
Confusion matrix
1. Classification Model Evaluation Metrics
For classification problems (where the output is a category or class label), the key evaluation
metrics include:
1.1 Accuracy
•Definition: The proportion of correct predictions out of all predictions.
• Formula:

• Where:
• TP = True Positives
• TN = True Negatives
• FP = False Positives
• FN = False Negatives

Use Case: Accuracy is a simple and commonly used metric but can be misleading when dealing with imbalanced classes (e.g., if
one class dominates the dataset).
1.2 Precision
•Definition: The proportion of positive predictions that are actually correct. It is the ability of the
model to not label as positive a negative sample.
• Formula:
• Use Case: Precision is important when false positives are costly or undesirable (e.g., spam detection,
where incorrectly labeling a legitimate email as spam is a mistake).
1.3 Recall (Sensitivity or True Positive Rate)
•Definition: The proportion of actual positive samples that are correctly identified. It measures
the model’s ability to capture all the positive instances.

•Use Case: Recall is crucial when false negatives are costly or risky (e.g., medical diagnoses,
where missing a positive case could have serious consequences).
1.4 F1-Score
•Definition: The harmonic mean of precision and recall. It balances the trade-off between
precision and recall.
•Use Case: The F1-score is especially useful when you need a balance between precision and
recall, and the classes are imbalanced.
1.5 Confusion Matrix
•Definition: A table that describes the performance of a classification algorithm by comparing the
predicted and actual values.
• The matrix consists of:
• True Positives (TP): Correctly predicted positive class instances.
• True Negatives (TN): Correctly predicted negative class instances.
• False Positives (FP): Incorrectly predicted positive class instances.
• False Negatives (FN): Incorrectly predicted negative class instances.
1.6 ROC Curve and AUC (Area Under the Curve)
•ROC Curve: A plot of the True Positive Rate (Recall) against the False Positive Rate (FPR) for
different thresholds.
•AUC: The area under the ROC curve; a measure of how well the model distinguishes between
classes.
• AUC ranges from 0 to 1, where 1 represents perfect classification and 0.5 represents random
guessing.
2. Regression Model Evaluation Metrics
For regression problems (where the output is a continuous value), evaluation metrics typically
include:
2.1 Mean Absolute Error (MAE)
•Definition: The average of the absolute differences between predicted and actual values.
• Formula:
• Use Case: MAE is useful when you want a simple, interpretable error metric. It treats all errors
equally.
2.2 Mean Squared Error (MSE)
•Definition: The average of the squared differences between the predicted and actual values.
Larger errors are penalized more heavily.
• Formula:
• Use Case: MSE is sensitive to outliers because the errors are squared, making it suitable when large
errors should be penalized more.
2.3 Root Mean Squared Error (RMSE)
•Definition: The square root of the MSE, which brings the error back to the same scale as the
original data.
• Formula:
• Use Case: RMSE provides a more interpretable error metric in the same units as the target variable
and penalizes large errors more than MAE.
2.4 R-squared (R²)
•Definition: A statistical measure that explains the proportion of variance in the target variable
that is explained by the model.
• Formula:
• Use Case: R² indicates how well the model fits the data. A value of 1 means the model explains all
variance, while 0 means it explains none.
3. Cross-Validation
•Definition: Cross-validation is a technique to assess how well the model generalizes by splitting
the data into multiple subsets (folds) and training/testing the model multiple times.
• k-Fold Cross-Validation: The data is split into k equally sized folds. The model is trained on k-1 folds
and tested on the remaining fold. This is repeated for each fold.
• Stratified k-Fold: Ensures that each fold has the same class distribution (useful for imbalanced
datasets).
4. Overfitting and Underfitting
•Overfitting: The model learns the training data too well, including noise and outliers, which can
result in poor performance on unseen data. High variance is associated with overfitting.
•Underfitting: The model is too simple and fails to capture important patterns in the training
data, leading to poor performance on both the training and test sets. High bias is associated
with underfitting.
5. Model Comparison
•Once multiple models have been trained, it's important to compare their evaluation metrics to
select the best-performing model for your problem. For classification tasks, this could mean
comparing accuracy, precision, recall, or F1-score. For regression, it could mean comparing MAE,
MSE, RMSE, or R².
Summary
Evaluating machine learning models involves using appropriate performance metrics to assess
how well the model generalizes to unseen data. The choice of metric depends on the problem
type (classification vs. regression) and the specific goals of the model. Tools like confusion
matrices, ROC curves, cross-validation, and various error metrics help in understanding model
performance, improving it, and choosing the best model for deployment.
Feature Engineering
Feature engineering is the process of selecting, manipulating, and
transforming raw data into features that can be used in supervised learning.
In order to make machine learning work well on new tasks, it might be
necessary to design and train better features.
As you may know, a “feature” is any measurable input that can be used in a
predictive model — it could be the color of an object or the sound of
someone’s voice.
Feature engineering, in simple terms, is the act of converting raw
observations into desired features using statistical or machine
learning approaches.
What is Feature Engineering
Feature engineering is a machine learning technique that leverages data to
create new variables that aren’t in the training set. It can produce new
features for both supervised and unsupervised learning, with the goal
of simplifying and speeding up data transformations while
also enhancing model accuracy. Feature engineering is required when
working with machine learning models. Regardless of the data or
architecture, a terrible feature will have a direct impact on your model.
Example for feature
engineering
Below are the prices of properties in x city. It shows the area of the
house and total price.

Now this data might have some errors or might be incorrect, not all
sources on the internet are correct. To begin, we’ll add a new column
to display the cost per square foot.
This new feature will help us understand a lot about our data. So, we have a
new column which shows cost per square ft. There are three main ways you
can find any error. You can use Domain Knowledge to contact a property
advisor or real estate agent and show him the per square foot rate. If your
counsel states that pricing per square foot cannot be less than 3400, you
may have a problem. The data can be visualised.

When you plot the data,

you’ll notice that one price
is significantly different from
the rest. In
the visualisation method,
you can readily notice the
problem. The third way is to
use Statistics to analyze
your data and find any
problem.
Process of Feature Engineering
Feature engineering consists of various process -
• Feature Creation: Creating features involves creating new variables which will be most helpful for our
model. This can be adding or removing some features. As we saw above, the cost per sq. ft column was a
feature creation.
• Transformations: Feature transformation is simply a function that transforms features from one
representation to another. The goal here is to plot and visualise data, if something is not adding up with
the new features we can reduce the number of features used, speed up training, or increase the accuracy
of a certain model.
• Feature Extraction: Feature extraction is the process of extracting features from a data set to identify
useful information. Without distorting the original relationships or significant information, this compresses
the amount of data into manageable quantities for algorithms to process.
• Exploratory Data Analysis : Exploratory data analysis (EDA) is a powerful and simple tool that can be
used to improve your understanding of your data, by exploring its properties. The technique is often
applied when the goal is to create new hypotheses or find patterns in the data. It’s often used on large
amounts of qualitative or quantitative data that haven’t been analyzed before.
• Benchmark : A Benchmark Model is the most user-friendly, dependable, transparent, and interpretable
model against which you can measure your own. It’s a good idea to run test datasets to see if your new
machine learning model outperforms a recognised benchmark. These benchmarks are often used as
measures for comparing the performance between different machine learning models like neural
networks and support vector machines, linear and non-linear classifiers, or different approaches like
bagging and boosting.
Importance Of Feature Engineering

Feature Engineering is a very important step in machine learning. Feature

engineering refers to the process of designing artificial features into an
algorithm. These artificial features are then used by that algorithm in order to
improve its performance, or in other words reap better results. Data scientists
spend most of their time with data, and it becomes important to make
models accurate.
When feature engineering activities are done correctly, the resulting dataset
is optimal and contains all of the important factors that affect the business
problem. As a result of these datasets, the most accurate predictive models
and the most useful insights are produced.
Feature Engineering Techniques for
Machine Learning

1.Imputation
When it comes to preparing your data for machine learning, missing values are
one of the most typical issues. Human errors, data flow interruptions, privacy
concerns, and other factors could all contribute to missing values. Missing values
have an impact on the performance of machine learning models for whatever
cause. The main goal of imputation is to handle these missing values. There are
two types of imputation :
Numerical Imputation: To figure out what numbers should be assigned to people
currently in the population, we usually use data from completed surveys or
censuses. These data sets can include information about how many people eat
different types of food, whether they live in a city or country with a cold climate,
and how much they earn every year. That is why numerical imputation is used to
fill gaps in surveys or censuses when certain pieces of information are missing.
#Filling all missing values with 0
data = data.fillna(0)
Categorical Imputation: When dealing with categorical columns, replacing
missing values with the highest value in the column is a smart solution.
However, if you believe the values in the column are evenly distributed and
there is no dominating value, imputing a category like “Other” would be a
better choice, as your imputation is more likely to converge to a random
selection in this scenario.
#Max fill function for categorical columns
data[‘column_name’].fillna(data[‘column_name’].value_counts().idxmax(),
inplace=True)
2.Handling Outliers
Outlier handling is a technique for removing outliers from a dataset. This
method can be used on a variety of scales to produce a more accurate data
representation. This has an impact on the model’s performance. Depending
on the model, the effect could be large or minimal; for example, linear
regression is particularly susceptible to outliers. This procedure should be
completed prior to model training. The various methods of handling outliers
include:
1.Removal: Outlier-containing entries are deleted from the distribution.
However, if there are outliers across numerous variables, this strategy may
result in a big chunk of the datasheet being missed.
2.Replacing values: Alternatively, the outliers could be handled as missing
values and replaced with suitable imputation.
3.Capping: Using an arbitrary value or a value from a variable distribution to
replace the maximum and minimum values.
Discretization : Discretization is the process of converting continuous
variables, models, and functions into discrete ones. This is accomplished by
constructing a series of continuous intervals (or bins) that span the range of
our desired variable/model/function.
3.Log Transform
Log Transform is the most used technique among data scientists. It’s mostly
used to turn a skewed distribution into a normal or less-skewed distribution.
We take the log of the values in a column and utilise those values as the
column in this transform. It is used to handle confusing data, and the data
becomes more approximative to normal applications.
//Log Example
df[log_price] = np.log(df[‘Price’])
4.One-hot encoding
A one-hot encoding is a type of encoding in which an element of a finite set is
represented by the index in that set, where only one element has its index
set to “1” and all other elements are assigned indices within the range [0, n-
1]. In contrast to binary encoding schemes, where each bit can represent 2
values (i.e. 0 and 1), this scheme assigns a unique value for each possible
case.
5.Scaling
Feature scaling is one of the most pervasive and difficult problems in
machine learning, yet it’s one of the most important things to get right. In
order to train a predictive model, we need data with a known set of features
that needs to be scaled up or down as appropriate. This blog post will explain
how feature scaling works and why it’s important as well as some tips for
getting started with feature scaling.
After a scaling operation, the continuous features become similar in terms of
range. Although this step isn’t required for many algorithms, it’s still a good
idea to do so. Distance-based algorithms like k-NN and k-Means, on the other
hand, require scaled continuous features as model input. There are two
common ways for scaling :
Normalization : All values are scaled in a specified range between 0 and 1
via normalisation (or min-max normalisation). This modification has no
influence on the feature’s distribution, however it does exacerbate the effects
of outliers due to lower standard deviations. As a result, it is advised that
outliers be dealt with prior to normalisation.
Standardization: Standardization (also known as z-score normalisation) is
the process of scaling values while accounting for standard deviation. If the
standard deviation of features differs, the range of those features will likewise
differ. The effect of outliers in the characteristics is reduced as a result. To
arrive at a distribution with a 0 mean and 1 variance, all the data points are
subtracted by their mean and the result divided by the distribution’s
variance.
Few Best Feature Engineering Tools

There are many tools which will help you in automating the entire feature
engineering process and producing a large pool of features in a short period
of time for both classification and regression tasks.
FeatureTools
Featuretools is a framework to perform automated feature engineering. It
excels at transforming temporal and relational datasets into feature matrices
for machine learning. Featuretools integrates with the machine learning
pipeline-building tools you already have. In a fraction of the time it would
take to do it manually, you can load in pandas dataframes and automatically
construct significant features.
FeatureTools Summary
•Easy to get started, good documentation and community support
•It helps you construct meaningful features for machine learning and predictive
modelling by combining your raw data with what you know about your data.
•It provides APIs to verify that only legitimate data is utilised for calculations,
preventing label leakage in your feature vectors.
•Featuretools includes a low-level function library that may be layered to
generate features.
•Its AutoML library(EvalML) helps you build, optimize, and evaluate machine
learning pipelines.
•Good at handling relational databases.
2) AutoFeat
AutoFeat helps to perform Linear Prediction Models with Automated Feature
Engineering and Selection. AutoFeat allows you to select the units of the input
variables in order to avoid the construction of physically nonsensical features.
AutoFeat Summary
•AutoFeat can easily handle categorical features with One hot encoding.
•The AutoFeatRegressor and AutoFeatClassifier models in this package have a
similar interface to scikit-learn models
•General purpose automated feature engineering which is Not good at handling
relational data.
•It is useful in logistical data
TsFresh
tsfresh is a python package. It calculates a huge number of time series
characteristics, or features, automatically. In addition, the package includes
methods for assessing the explanatory power and significance of such traits
in regression and classification tasks.
TsFresh Summary
•It is Best open source python tool available for time series classification and
regression.
•It helps to extract things such as the number of peaks, average value,
maximum value, time reversal symmetry statistic, etc.
•It can be integrated with FeatureTools.
OneBM
OneBM interacts directly with a database’s raw tables. It slowly joins the
tables, taking different paths on the relational tree. It recognises simple data
types (numerical or categorical) and complicated data types (set of numbers,
set of categories, sequences, time series, and texts) in the joint results and
applies pre-defined feature engineering approaches to the supplied types.
•Both relational and non-relational data are supported.
•When compared to FeatureTools, it generates both simple and complicated
features.
•It was put to the test in Kaggle competitions, and it outperformed state-of-
the-art models.
ExploreKit
Based on the idea that extremely informative features are typically the
consequence of manipulating basic ones, ExploreKit identifies common
operators to alter each feature independently or combine multiple of them.
Instead of running feature selection on all developed features, which can be
quite huge, meta learning is used to rank candidate features.
Comparision of Feature
Engineering Tools
Conclusion of feature
engineering
Feature engineering is the development of new data features from raw data.
With this technique, engineers analyze the raw data and potential
information in order to extract a new or more valuable set of features.
Feature engineering can be seen as a generalization of mathematical
optimization that allows for better analysis.
Different Types of Plots in
Machine Learning
In machine learning, various plots are used to visualize data, model performance, and the relationships
between features. Here are some commonly used plots:
1. Scatter Plot
• Purpose: To show the relationship between two continuous variables.
• When to use: If you're trying to understand how two variables correlate with each other.
• Example: Plotting "Age" vs. "Salary" to check how age correlates with salary.
2. Line Plot
• Purpose: Used to display data trends over a continuous variable (often time).
• When to use: To visualize changes over time or continuous variables.
• Example: Plotting training loss vs. number of epochs to check if the model is improving over time.
3. Histogram
•Purpose: Shows the distribution of a single variable.
•When to use: To understand how a feature is distributed or the frequency of different ranges of
values.
•Example: Plotting the distribution of ages or salaries.
4. Box Plot (Box-and-Whisker Plot)
•Purpose: To visualize the distribution of data and identify outliers.
•When to use: When you want to show the median, quartiles, and detect outliers.
•Example: Comparing the salary distribution across different departments.
5. Pair Plot (or Scatterplot Matrix)
•Purpose: Displays pairwise relationships between multiple continuous variables.
•When to use: To see correlations or patterns between multiple features in a dataset.
•Example: Pair plotting all features of the Iris dataset to see relationships.
6. Heatmap
•Purpose: Shows the correlation matrix between variables or the value of a matrix.
•When to use: To quickly identify patterns, correlations, or cluster similarity.
•Example: Showing the correlation between different features in a dataset.
7. ROC Curve (Receiver Operating Characteristic Curve)
•Purpose: Used for evaluating the performance of a binary classification model.
•When to use: To assess a classifier's ability to distinguish between classes.
•Example: Plotting the true positive rate vs. the false positive rate for different thresholds.
8. Precision-Recall Curve
•Purpose: Visualizes the trade-off between precision and recall for different thresholds.
•When to use: When dealing with imbalanced classes and you want to evaluate model
performance based on recall and precision.
•Example: A classifier predicting rare events, like fraud detection.
9. Confusion Matrix
•Purpose: A table used to evaluate the performance of a classification model.
•When to use: To show the true positives, false positives, true negatives, and false negatives.
•Example: For a binary classifier, showing how many correct and incorrect predictions were
made.
10. Learning Curve
•Purpose: Plots training and validation error against training size or epochs.
•When to use: To detect overfitting or underfitting in your model by showing performance trends
as more data or epochs are added.
•Example: Showing how error decreases with more training data or epochs in a neural network.
11. Feature Importance Plot
• Purpose: Visualizes the importance of each feature in a model's decision-making process.
• When to use: To understand which features contribute most to the prediction in models like
Random Forest or XGBoost.
• Example: Plotting feature importances from a trained decision tree model.
12. Calibration Curve
• Purpose: Compares predicted probabilities with actual outcomes.
• When to use: When you need to evaluate how well the model’s predicted probabilities align with
the true probabilities.
• Example: Checking if predicted probabilities from a classifier match the true likelihood of the event.
13. t-SNE Plot (t-Distributed Stochastic Neighbor Embedding)
•Purpose: A dimensionality reduction technique used to visualize high-dimensional data in 2D or
3D.
•When to use: To visualize high-dimensional data, like word embeddings or neural network
activations.
•Example: Visualizing clusters in a high-dimensional feature space, such as images or text data.
14. PCA Plot (Principal Component Analysis)
•Purpose: Reduces dimensionality while preserving variance in the data.
•When to use: To visualize high-dimensional data in a lower-dimensional space.
•Example: Plotting the first two principal components of a dataset to explore clusters or trends.
15. Decision Boundary Plot
•Purpose: Displays the decision boundary of a classification model.
•When to use: To visualize how well a model separates different classes in 2D space.
•Example: For a binary classifier, showing how the model divides the feature space between the
two classes.

Unit I 1
No ratings yet
Unit I 1
203 pages
ML Lab
No ratings yet
ML Lab
44 pages
Marketing Mix-1
No ratings yet
Marketing Mix-1
109 pages
Lusch & Nambisan (2015) Digital Innovation From An SD-logic Perspective PDF
No ratings yet
Lusch & Nambisan (2015) Digital Innovation From An SD-logic Perspective PDF
23 pages
Blockchain Transforming Geospatial Data
100% (3)
Blockchain Transforming Geospatial Data
38 pages
BB Advanced Technical English Grammar 1
No ratings yet
BB Advanced Technical English Grammar 1
140 pages
Feature Selection Engineering
No ratings yet
Feature Selection Engineering
72 pages
Quotes: Great
No ratings yet
Quotes: Great
182 pages
UE20CS302 Unit4 Slides
No ratings yet
UE20CS302 Unit4 Slides
312 pages
Fundamentals of Business Analysis: Bite Sized Training Sessions
No ratings yet
Fundamentals of Business Analysis: Bite Sized Training Sessions
21 pages
01 CorporateBranding Session 01
No ratings yet
01 CorporateBranding Session 01
116 pages
Calculating Balanced Workload in Line Balancing
No ratings yet
Calculating Balanced Workload in Line Balancing
30 pages
Anglais Technical 2025
No ratings yet
Anglais Technical 2025
174 pages
Why Do Smart Managers Fail
No ratings yet
Why Do Smart Managers Fail
5 pages
Neural Networks1
No ratings yet
Neural Networks1
164 pages
Exploring The Relationship Between Business Model Innovation, Corporate Sustainability, and Organisational Values Within The Fashion Industry
No ratings yet
Exploring The Relationship Between Business Model Innovation, Corporate Sustainability, and Organisational Values Within The Fashion Industry
18 pages
Career Press - What Smart People Do When Dumb Things Happen at Work
100% (1)
Career Press - What Smart People Do When Dumb Things Happen at Work
241 pages
Probability Models in Engineering
No ratings yet
Probability Models in Engineering
106 pages
Calculus 3 Course Notes For MATH 237 Edi
100% (2)
Calculus 3 Course Notes For MATH 237 Edi
258 pages
Initiative v3.0
No ratings yet
Initiative v3.0
39 pages
Assembly Systems and Line Balancing Methods
No ratings yet
Assembly Systems and Line Balancing Methods
20 pages
Accenture-Compressive Disruption Slideshare PDF
100% (1)
Accenture-Compressive Disruption Slideshare PDF
18 pages
Machine Learning Basics Guide
100% (1)
Machine Learning Basics Guide
124 pages
Essentials of Marketing
No ratings yet
Essentials of Marketing
140 pages
Process Analysis Presentation
No ratings yet
Process Analysis Presentation
170 pages
Fake News Detection with NLP
No ratings yet
Fake News Detection with NLP
62 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
135 pages
Excel Tips for Advanced Users
No ratings yet
Excel Tips for Advanced Users
197 pages
Automl: A Perspective Where Industry Meets Academy
No ratings yet
Automl: A Perspective Where Industry Meets Academy
154 pages
The Gillette
No ratings yet
The Gillette
178 pages
The Zen of Social Media Marketing PDF
No ratings yet
The Zen of Social Media Marketing PDF
155 pages
Customer-Dominant Logic in Service
No ratings yet
Customer-Dominant Logic in Service
18 pages
Take Off Workbook
No ratings yet
Take Off Workbook
214 pages
Total Quality Management in Product Design
No ratings yet
Total Quality Management in Product Design
113 pages
Merged PDF Cset301 Ai-Ml
No ratings yet
Merged PDF Cset301 Ai-Ml
610 pages
Excellence " Nine:: Excellence " Nine: Losers/Winners The Age of Small/Ish
No ratings yet
Excellence " Nine:: Excellence " Nine: Losers/Winners The Age of Small/Ish
101 pages
Management Module - X
No ratings yet
Management Module - X
237 pages
Soft Computing Decode
No ratings yet
Soft Computing Decode
142 pages
ESG Compliance Insights
No ratings yet
ESG Compliance Insights
13 pages
YUMPU PDF Downloader Tool
No ratings yet
YUMPU PDF Downloader Tool
146 pages
CNN Architectures for Text and Image
No ratings yet
CNN Architectures for Text and Image
167 pages
Week 1
No ratings yet
Week 1
184 pages
1 166
100% (1)
1 166
172 pages
How To Do A SWOT Analysis
No ratings yet
How To Do A SWOT Analysis
7 pages
Tom Peters': Re-Imagine!
100% (1)
Tom Peters': Re-Imagine!
162 pages
Unit-3 Market Segmentation, Targeting and Positioning
No ratings yet
Unit-3 Market Segmentation, Targeting and Positioning
125 pages
Simplifying Organizational Complexity
No ratings yet
Simplifying Organizational Complexity
6 pages
Service-Dominant Logic 2025
No ratings yet
Service-Dominant Logic 2025
22 pages
Evolution of Business Consulting
No ratings yet
Evolution of Business Consulting
2 pages
Principles of Management
No ratings yet
Principles of Management
297 pages
Marketing Optional - Measuring Integrated Marketing Communication
No ratings yet
Marketing Optional - Measuring Integrated Marketing Communication
27 pages
Value Co-Creation Concept and Contexts of Application and Study
No ratings yet
Value Co-Creation Concept and Contexts of Application and Study
8 pages
Excellence " ONE: Execution IS Strategy
No ratings yet
Excellence " ONE: Execution IS Strategy
142 pages
Chapter 3 Qualitative Process Analysis
No ratings yet
Chapter 3 Qualitative Process Analysis
27 pages
Corporate Sustainability Framework
No ratings yet
Corporate Sustainability Framework
14 pages
ML - Full Slides Srikanth Allamshatty
No ratings yet
ML - Full Slides Srikanth Allamshatty
369 pages
2014 Lean Management Enterprise Compendium With Links
No ratings yet
2014 Lean Management Enterprise Compendium With Links
164 pages
The Zappos Business Strategy Overview
100% (2)
The Zappos Business Strategy Overview
60 pages
ML 2
No ratings yet
ML 2
8 pages
Data Types
No ratings yet
Data Types
2 pages
Are School Uniforms A Good Fit Results F
No ratings yet
Are School Uniforms A Good Fit Results F
29 pages
Econometrics I - Chapter 1 2 3
No ratings yet
Econometrics I - Chapter 1 2 3
87 pages
Theory of HPLC Quantitative and Qualitative HPLC PDF
No ratings yet
Theory of HPLC Quantitative and Qualitative HPLC PDF
24 pages
AI/ML Techniques for 5G Throughput
No ratings yet
AI/ML Techniques for 5G Throughput
11 pages
Camel 2
No ratings yet
Camel 2
9 pages
Winter Et Al 2007
No ratings yet
Winter Et Al 2007
28 pages
CausalML Book
50% (2)
CausalML Book
496 pages
CH 10
No ratings yet
CH 10
9 pages
Data Science - Syllabus
No ratings yet
Data Science - Syllabus
14 pages
Security Countermeasures in The Cyber World
No ratings yet
Security Countermeasures in The Cyber World
7 pages
UNIT2RM
No ratings yet
UNIT2RM
15 pages
An Overview of Performance of The Health Insurance Sector in India
No ratings yet
An Overview of Performance of The Health Insurance Sector in India
6 pages
Updated Survey PAPER
No ratings yet
Updated Survey PAPER
5 pages
587 Sample Solutions Manual Probability and Statistics For Engineering and The Sciences 9th Edition by Devore & Matt Carlton
No ratings yet
587 Sample Solutions Manual Probability and Statistics For Engineering and The Sciences 9th Edition by Devore & Matt Carlton
10 pages
OLS Assumptions and Diagnostics
No ratings yet
OLS Assumptions and Diagnostics
18 pages
A/B Testing for E-commerce Sites
No ratings yet
A/B Testing for E-commerce Sites
15 pages
X X B X B X B y X X B X B N B Y: QMDS 202 Data Analysis and Modeling
No ratings yet
X X B X B X B y X X B X B N B Y: QMDS 202 Data Analysis and Modeling
6 pages
Operations and Demand Management
No ratings yet
Operations and Demand Management
18 pages
Advanced Fiber Information System Length Measureme
No ratings yet
Advanced Fiber Information System Length Measureme
8 pages
Accounting Paper Format Posting To General Ledger
100% (2)
Accounting Paper Format Posting To General Ledger
12 pages
Interrupted Time Series Presentation Notes
No ratings yet
Interrupted Time Series Presentation Notes
4 pages
Revision MM2 Set 6
No ratings yet
Revision MM2 Set 6
5 pages
Gamiz - 2021
No ratings yet
Gamiz - 2021
12 pages
Autoit Changelog Complete
No ratings yet
Autoit Changelog Complete
111 pages
Econometrics II
No ratings yet
Econometrics II
15 pages
Finance Exam for Students
No ratings yet
Finance Exam for Students
7 pages
Hadoop in Business Analytics
No ratings yet
Hadoop in Business Analytics
2 pages
CO2 Emission Project Source Code
No ratings yet
CO2 Emission Project Source Code
2 pages
04 Multiple Linear Regression
No ratings yet
04 Multiple Linear Regression
19 pages
Analytical Method Validation Cyproheptadine Hydrochloride and Tricholine Citrate Syrup
No ratings yet
Analytical Method Validation Cyproheptadine Hydrochloride and Tricholine Citrate Syrup
13 pages

AIML Chapter 4

Uploaded by

AIML Chapter 4

Uploaded by

Chapter-4:

Modeling and Evaluation:

3) Handling Categorical Data

Explaining Accuracy, Precision, Recall, and F1 Score | by Vikas Singh Bhadouria

When you plot the data,

Feature Engineering is a very important step in machine learning. Feature

You might also like