0% found this document useful (0 votes)
13 views68 pages

FEMODULE - 1 Daa DBMS Python Algorithms

This document provides an introduction to feature engineering in machine learning, detailing the types of data and features, as well as the importance of preprocessing. It covers various data types, including structured, semi-structured, and unstructured data, along with their characteristics. Additionally, it emphasizes the significance of data cleaning, handling missing data, and feature transformation for improving model performance.

Uploaded by

alph2440
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views68 pages

FEMODULE - 1 Daa DBMS Python Algorithms

This document provides an introduction to feature engineering in machine learning, detailing the types of data and features, as well as the importance of preprocessing. It covers various data types, including structured, semi-structured, and unstructured data, along with their characteristics. Additionally, it emphasizes the significance of data cleaning, handling missing data, and feature transformation for improving model performance.

Uploaded by

alph2440
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 68

MODULE - 1

Introduction to Feature Engineering: Introduction to


Data and Features, Importance of Features in Machine
Learning.

Data types and features: Numerical, Categorical,


Ordinal, Discrete, Continuous, Interval and Ratio.

Basic Feature Preprocessing: Handling Missing Data,


Data Cleaning, Feature Scaling, Normalization, and
Transformation.
Introduction to Feature Engineering
1.INTRODUCTION TO DATA
“DATA”
• Observations of real-world phenomena.
• Each piece of data provides a small window into a limited aspect of reality and gives us
a picture of the whole.
• Composed of a thousand little pieces, and there’s always measurement noise and
missing pieces.
• Refers to raw, unprocessed information that you use to train models.
• Data can be anything from numbers, categories, images, text, or even sound.
• The type of data you're working with will determine the kind of features you'll create.
• Instances
1. Stock market data:-observations of daily stock prices, announcements of earnings by
individual companies, and even opinion articles from pundits.
2. Personal biometric data:-include measurements of our minute-by-minute heart rate,
Introduction to Feature Engineering
1.INTRODUCTION TO DATA
“DATA”
3. Customer intelligence data:- includes observations such as “Alice bought two books on
Sunday,” “Bob browsed these pages on the web site,” and “Charlie clicked on the special
offer link from last week.”
• Data comes in various formats, including:
 Numerical: Continuous (e.g., height, weight, temperature) or discrete (e.g., number of
children, count of items).
 Categorical: Non-numeric data that represents categories or classes (e.g., color, gender,
region).
 Text: Strings of characters that require special techniques (like tokenization or
embeddings) to be used effectively by machine learning models.
 Time-series: Data that is collected over time (e.g., stock prices, sensor data).
 Image: Pixels or pixels' characteristics that can represent visual information.
Introduction to Feature Engineering
1.INTRODUCTION TO DATA
“DATA”

• Each row here represents an individual observation or instance (in this case, a house).
• Each column represents a specific characteristic or attribute of that instance.
Introduction to Feature Engineering
1.INTRODUCTION TO DATA
“FEATURES”
• A numeric representation of raw data.
• An individual measurable property or characteristic of the phenomenon you're trying to
model.
• Inputs for a machine learning algorithm.
• Example:-In a dataset predicting house prices, features could include the square footage,
number of bedrooms, neighborhood, and year built.
• Key points:
1.Raw Features: These are the unprocessed variables from the dataset.
Example:-Raw numerical values of a person's age, income, etc.
2.Derived Features: These are features created by transforming or combining raw features.
Example:Creating a "age group" feature from the raw age feature.
(e.g., <18 = "child", 18-65 = "adult", >65 = "senior").
Introduction to Feature Engineering
FEATURE ENGINEERING
• Act of extracting features from raw data and transforming them into formats that are
suitable for the machine learning model.
• A crucial step in the machine learning pipeline, i.e the right features can ease the difficulty
of modeling, and therefore enable the pipeline to output results of higher quality.
• Might involve several steps, such as:
1. Data Cleaning: Removing or correcting noisy, missing, or inconsistent data.
2. Data Transformation: Changing the scale or format of the data (e.g.,
normalization, standardization).
3. Feature Creation: Combining or manipulating features to create new ones (e.g.,
adding the "age group" feature mentioned above).
4. Feature Selection: Choosing which features are most important to the model. This
can be done manually or through statistical techniques.
5. Feature Encoding: Converting categorical data into a numeric format, for
example, through one-hot encoding or label encoding.
Introduction to Feature Engineering
FEATURE ENGINEERING
Example: Feature Engineering for a Housing Price Prediction
• Tasked with building a machine learning model to predict the price of houses based
on various factors.
1)Raw Data:
 Features might have like:
1. Square footage of the house (numerical)
2. Number of bedrooms (numerical)
3. Number of bathrooms (numerical)
4. Year the house was built (numerical)
5. Neighborhood (categorical)
2)Feature Engineering:
 The following transformations could perform :
Introduction to Feature Engineering
FEATURE ENGINEERING
1. Square Footage: Create a feature that represents the "price per square foot"
(Price/Square Foot).
2. Age of House: Create a new feature for the "age of the house" by subtracting the year
built from the current year.
3. Neighborhood Encoding: Use one-hot encoding to convert the "Neighborhood"
feature into binary columns (e.g., is the house in neighborhood A? Yes or no).
4. Price Category: Create a categorical feature like "Price Range" based on the price
of the house (e.g., low, medium, high).
• After these transformations, the data is more structured, and the model can better
identify patterns in the features.
Introduction to Feature Engineering
FEATURE ENGINEERING
Introduction to Feature Engineering
FEATURE ENGINEERING
Introduction to Feature Engineering
IMPORTANCE OF FEATURES IN MACHINE LEARNING
Features
•Attributes or variables, are the individual measurable properties or characteristics of
the phenomenon being observed.
•In machine learning, it serve as the input to our models, allowing them to learn patterns
and make predictions.
•Their quality directly impacts the performance, interpretability, and efficiency of any
machine learning model.
1.Model Performance and Accuracy
Garbage In, Garbage Out: If the features are irrelevant, noisy, redundant, or
poorly chosen, even the most sophisticated algorithms will struggle to achieve good
performance.
Signal vs. Noise: Good features amplify the signal (the underlying patterns) and
reduce the noise in the data, making it easier for the model to learn meaningful
relationships.
Introduction to Feature Engineering
 Predictive Power: High-quality features are directly correlated with the target variable,
providing strong predictive power.
 Overfitting and Underfitting: Poor feature selection can lead to overfitting (model
learns noise) or underfitting (model doesn't capture underlying patterns).
2. Model Interpretability and Explainability:
 Understanding Model Decisions: When features are meaningful and well-defined, it's
easier to understand why a model makes certain predictions. This is particularly
important in domains like healthcare or finance where transparency is critical.
 Identifying Key Drivers: Feature importance techniques help identify which features
contribute most to the model's output, providing insights into the underlying process.
 Trust and Acceptance: Interpretable models are more likely to be trusted and
accepted
by stakeholders and end-users.
3.Computational Efficiency and Scalability:
Reduced Training Time: Fewer, more relevant features mean less data for the model to
Introduction to Feature Engineering
Lower Memory Requirements: Storing and manipulating fewer features reduces memory
consumption, which is crucial for large datasets.
Simpler Models: Models built with fewer features are often simpler, making them easier to
deploy and maintain.
Mitigating the Curse of Dimensionality: As the number of features increases, the sparsity of
data in high-dimensional spaces makes it exponentially harder for models to learn effectively.
4. Data Quality and Preparation:
Feature Engineering: Art of creating new features from raw data to improve model
performance and often involves domain expertise and creativity (e.g., combining existing
features, extracting information from text/images).
Feature Scaling and Normalization: Ensuring features are on a similar scale (e.g., min-
max scaling, standardization) prevents features with larger ranges from dominating the
learning process.
Handling Missing Values: Appropriate strategies for dealing with missing data in features
(imputation, removal) are essential.
Introduction to Feature Engineering
 Outlier Detection and Treatment: Outliers in features can skew model training and
lead to inaccurate predictions.
5. Robustness and Generalization:
Less Prone to Noise: Well-engineered features make models more robust to noisy or
irrelevant data points.
Better Generalization: Models trained on relevant and representative features are more
likely to generalize well to unseen data, rather than just memorizing the training
examples.
Data Types And Features
DATA
• A collection of information gathered by observations,measurements,research or analysis.
• Consists of facts, numbers,names ,figures or even description of things.
• Data Types:-What type of data is used to analyze the data.
• Dictate the types of transformations and algorithms that can be applied to features.
• Broadly classified in three types
1.Structured Data
• Information organize in a predefined manner,often in tabular form with rows and columns.
• Easily searchable and efficiently process by algorithms and databases.
Characteristics
 Follows a consistent format and structure,with predefined fields and types.
 Stores in relational databases (like SQL databases) or Excel Files which uses a table-based
format.
 Easily query using standard query languages (SQL) with its organized structure.
Data Types And Features
 Contains clearly defined and constrained types of data like numbers,dates and strings.
• Examples:-Spreadsheets with rows and columns,SQL databases with tables , CSV files
with comma-separated values.
2.Semi-structured Data
• Information that does not conform to a rigid structure.
• Contains tags or markers to separate semantic elements and enforce hierarchies of
records and fields within the data.
• Lies between structured and unstructured data.
• Offers more flexibility than fully structured data while maintaining some level of
organization.
Characteristics
 Schema or structure is not fixed,allowing for variations and the inclusion of different
attributes for different records.
 Metadata provides information with structure and hierarchy ie tags or markers.
Data Types And Features
 Tree-like structure,allowing for nested data elements.
 Store in formats which support nested structures, such as JSON or XML and process by
NoSQL databases.
• Example:- JSON(JavaScript Object Notation),XML(Extensible Markup
Language),YAML(Yet Another Markup Language).
3.Unstructured Data
• Information that lacks a predefined data model or organization.
• Does not fit neatly into tables or rows and columns.
• Text-heavy and include a wide variety of content types and formats.
Characteristics
 Doesn’t follow a specific schema or structure.
 Can be in the form of text,images,audio,video,social media posts,emails.
 More challenging to store,manage and analyze due to its lack of structure.
 Data comes in large volumes normally.
Data Types And Features
 Requires advanced techniques and tools,such as Natural Language Processing(NLP),
Machine Learning(ML),Artificial Intelligence(AI) and text analytics to extract
meaningful information and insights.
 Provide valuable insights that are not easily obtainable from structured data alone,despite
its complexity.
• Example:- Text Documents(Word documents,PDFs),e-mail messages,social media
posts(Tweets,Facebook posts),multimedia files(images,audio files,video files),web
pages,log files etc.

Structured Data
• Deals with Data Analytics.
• Classified as :-
1.Qualitative (non-numeric) data.
2.Quantitative (numeric) data.
Data Types And Features
1.Qualitative (Categorical) Data
• Features that can take one of a limited number of values.
• Categorizes the data into various categories.
• Explains the features of the data such as in statistics , gender of people , their family name and
occupation.
• Categorized into two :-
1. Nominal Data
2. Ordinal Data
1.Nominal Data
• Data type having categories or names without ranked or inherent ordered.
• Used to categorize observations into groups and the groups are not comparable.
• Represent using frequency tables and bar charts, which displays the number or proportion of
observations in each category.
• Example:- A frequency table for gender might show the number of males and females in a
sample of people.
Data Types And Features
2.Ordinal Data
• Categories with a meaningful order or ranking.
• Measure subjective attributes or opinions , where there is a natural order to the responses.
• Example :- Education level(Elementary, Middle,High School,College),job position (Manager,
Supervisor, Employee).
2.Quantitative (Numeric) Data
• Values with numeric types (int, float, etc.).
• Used to represent the height,weight,length etc.
• Classified into two :-
1. Discrete Data
2. Continuous Data
1.Discrete Data
• Represent countable values with a finite number of outcomes.
• Data types have values that can be easily counted as whole numbers.
Data Types And Features
• Data type in statistics that only uses discrete value or single values.
• Example:- Number of students in a class,number of members in a family etc.
2.Continuous Data
• Take any value within a range.
• Type of quantitative data that represent the data in a continuous range.
• Variable in the data set can have any value between the range of the data set.
• There will be a minimum and maximum value and all other values falls between the
minimum and maximum values.
• Example:- Weight of each student in a class,salary of workers in a factory.
Data Types And Features
Data Types And Features
• Two types of Continuous Data
1.Interval Scaled Data
2. Ratio Scaled Data

1.Interval Scaled Data


• Has meaningful differences between values, but there is no true zero point – zero doesn’t mean
“nothing”.
• Represents ordered values where the differences between them are meaningful and consistent.
• Allows to measure all quantitative attributes.
• Any measurement of interval scale can be ranked,counted,subtracted or added.
• Equal intervals separate each number on the scale.
• Hold no true zero and can represent values below zero.
• Example:-Temperature in Celsius or Fahrenheit, dates, and time of day.
• For instance, 20°C is not twice as warm as 10°C, and the zero point in Celsius is not the
absence of temperature.
Data Types And Features
2.Ratio Scaled Data
• Has a true zero point, meaning zero indicates the absence of the measured quantity.
• Same properties as interval scales.
• Use it to add, subtract, or count measurements.
• Differ by having a character of origin, which is the starting or zero-point of the scale.
• Represents ordered values with meaningful and consistent differences, but it has a true zero
point.
• Example:-Height, weight, age, and income.Zero height means no height, and a person who is
6 feet tall is twice as tall as someone who is 3 feet tall.
Basic Feature Preprocessing
Data Cleaning
• Process of fixing or removing incorrect,corrupted,incorrectly formatted,duplicate or
incomplete data within a dataset.
• Process of removing data that does not belong to the dataset.
• Crucial step in data analytics.
• Reasons
1.Accuracy of Analysis
 If the data contains errors or inconsistencies , the analysis based on that data will be
inaccurate or misleading.
 Clean Data :- Insights and conclusions drawn from the analysis are reliable.
2.Improved Data Quality
 Data cleaning removes errors,duplicates and inconsistencies,improving the overall quality of
the dataset.
 More robust and trustworthy.
Basic Feature Preprocessing
Data Cleaning
3.Efficiency in Processing
• Clean Data:- Reduces the complexity of processing,leading to faster and more efficient
data manipulation and analysis.
• Reducing the computational resources needed.
4.Better Decision Making
• Clean Data:- Allows better insights,leading to more informed decision making.
• Important in business contexts where decisions are based on data driven insights.
5.Enhanced Model Performance
• Machine learning models perform better when trained on clean data.
• Noise,outliers and errors in the data can lead to poor model performance,including
overfitting and underfitting.
6.Compliance and Legal Requirements
• Regulations and standards that require data to be accurate and clean.
Basic Feature Preprocessing
Data Cleaning
• Helps organizations to comply avoiding potential legal issues.
7.Consistency across Data Sources
• Data Cleaning:- Ensures consistency when integrating data from multiple sources.
• Standardizes the formats,resolves discrepancies and ensures that data is comparable across
datasets.
8.Cost Efficiency
• Investing time in data cleaning save costs by preventing errors that might arise later in the
data analysis or decision making process.
9.Facilitates Data Exploration
• Clean data allows data scientists to explore the data more effectively , identify patterns
and generate hypothesis.
• Provides a clear and accurate picture of what the data represents.
Basic Feature Preprocessing
Data Cleaning
• Several aspects to be consider while cleaning the data
• Includes
1. Handling Missing Data
2. Removing Duplicates
3. Standardizing Data
4. Correcting inconsistent Data
5. Handling Outliers
6. Filtering Unnecessary Data
7. Transforming Data
8. Validating Data
1.Handling Missing Data
• A crucial step in feature engineering, as many machine learning algorithms can't handle
missing values directly.
Basic Feature Preprocessing
• Several ways or strategies to handle missing data
1.Deletion:- Simplest but often the least recommended, especially with large amounts of
missing data.
2.Imputation:-Involves filling in missing values with estimated ones. This is generally
preferred over deletion as it retains more data.
3.Flagging:-Creating indicator variables or missing value flags. Involves creating new binary
columns that indicate whether a value in the original feature was missing.
• Data stored in CSV files can be read to a Dataframe which is a table-like structure in
pandas.
• Example:-emp.csv file
Basic Feature Preprocessing
1.1.Deleting Rows from DataFrames
• In deletion,remove rows or columns that does not contain any value.
• Use various methods depending on specific requirements.
1.1.1 Removing Rows by Index
• A list of index labels is passed and the rows corresponding to those labels are dropped using
drop() method of DataFrame.
• Read from the CSV file emp.csv,NaN stands for Not a Number(represents blank space).
Basic Feature Preprocessing
• After executing the program , DataFrame will delete rows with index numbers 0 and 3.
1.1.2.Removing Rows by Condition
• To remove rows based on a condition,use Boolean indexing.
• Example:- Deletes all rows with salary < 50000.
Basic Feature Preprocessing
1.1.3.Removing Duplicate Rows
• To remove duplicate rows,use the drop_duplicates() method.
Basic Feature Preprocessing
• In the above DataFrame,records with index numbers 6 and 8 are duplicated.
Basic Feature Preprocessing
1.1.4.Removing Rows with Missing Values
• To remove rows with missing values,use the dropna() method.
Basic Feature Preprocessing
1.1.5.Removing Rows by Index Range
• To remove a range of rows by their index,use the slicing and drop.
• Upper index of the slicing will not be included.
Basic Feature Preprocessing
1.2.Deleting Columns from DataFrames
drop( ) method
• Uses to remove columns by specifying the column names and setting the axis parameter to
1(denotes columns).
• Returns a new DataFrame without the specified columns,but modify the original
DataFrame inplace by setting the inplace parameter to True.
1.2.1. Removing columns by using Label
import pandas as pd
df=pd.read_csv('emp.csv')
print("Original DataFrame")
print(df)
print("DataFrame after removing the columns Dept and Salary")
df.drop(['Dept','Salary'],axis=1,inplace=True)
print(df)
Basic Feature Preprocessing

1.2.2. Removing columns by using Index


• Remove columns based on their index positions using the drop method.
import pandas as pd
df=pd.read_csv('emp.csv')
print("Original DataFrame")
print(df)
Basic Feature Preprocessing
print("DataFrame after removing the columns 1 and 3")
df.drop(df.columns[[1,3]],axis=1,inplace=True)
print(df)
Basic Feature Preprocessing
1.2.3. Removing Range of columns by using Indices
• Remove the range of columns by passing their indices to the drop method.
• Upper index in the range will not be included.
import pandas as pd
df=pd.read_csv('emp.csv')
print("Original DataFrame")
print(df)
print("DataFrame after removing the columns from 0 to 3")
df.drop(df.columns[0:3],axis=1,inplace=True)
print(df)
Basic Feature Preprocessing
1.2.4. Removing Columns using del Keyword
• del keyword uses to delete a column from the DataFrame.
import pandas as pd
df=pd.read_csv('emp.csv')
print("Original DataFrame")
print(df)
print("DataFrame after removing the columns Salary")
del df['Salary']
print(df)
Basic Feature Preprocessing
1.2.5. Removing Columns using pop( ) Method
• pop( ) method:-uses to remove a column and the removed column can be returned as a
Series.
import pandas as pd
df=pd.read_csv('emp.csv')
print("Original DataFrame")
print(df)
d=df.pop('Dept')
print("DataFrame after removing the column Dept")
print(df)
print("Removed column Dept")
print(d)
Basic Feature Preprocessing
1.3.Flagging
• Missing values are marked with a specific indicator like 0 so that they can be handled
during analysis.
• Missing values which will be replaced with zero.
• fillna() method can fill the missing values.
• Parameter inplace=True will replace the old DataFrame with new one.
import pandas as pd
df=pd.read_csv('emp.csv')
print("Original Dataframes with missing values");
print(df)
#for filling missing values with 0
print("Dataframes after filling missing values with zeroes");
df.fillna(0,inplace=True)
print(df)
Basic Feature Preprocessing
1.4.Imputation
• Fill in missing values using mean,median or methods like forward fill or backward fill.
• In forward fill,the missing values are replaced with the previous non-missing value.
• In backward fill,replaces missing values with the next non-missing value.
import pandas as pd
df=pd.read_csv('emp.csv')
print("Original Dataframes with missing values");
print(df)
#for filling missing values with 0
print("Dataframes after filling missing values with zeroes");
mean_value=df['Salary'].mean()
df.fillna({'Salary':mean_value},inplace=True)
print(df)
Basic Feature Preprocessing
1.4.Imputation
Basic Feature Preprocessing
1.4.Imputation
Replacement of missing values with forward fill.
import pandas as pd
df=pd.read_csv('emp.csv')
print("Original Dataframes with missing values");
print(df)
#for filling missing values with forward fill
print("Dataframes after filling missing values with forward fill");
df=df.ffill()
print(df)
Basic Feature Preprocessing
1.4.Imputation
Replacement of missing values with backward fill.
import pandas as pd
df=pd.read_csv('emp.csv')
print("Original Dataframes with missing values");
print(df)
#for filling missing values with backward fill
print("Dataframes after filling missing values with backward fill");
df=df.bfill()
print(df)
2.Data Cleaning
Basic Feature Preprocessing
• Includes handling missing data,removing duplicates,standardizing data,correcting
inconsistent data,handling outliers,filtering unnecessary data,transforming data and
validating data.
2.1.Removing Duplicates
• Import data from external sources(e.g., CSV files,databases) to a DataFrame,duplicates may
exist in the original data.
• Need to identify and remove duplicate records in the dataset to ensure accuracy and prevent
bias in analysis.
• Duplicate records may occur in a dataframe due to causes as:
 Combine two DataFrames using merge or join operations,duplicate rows introduce
especially if the join keys are not unique.
 Concatenate multiple DataFrames,duplicates can occur if the DataFrames share common
rows.
 Manual data entry leads to duplicate rows if the same information is entered multiple times.
 Automated data collection processes like web scraping or logging events, inadvertently
Basic Feature Preprocessing
2.1.1.drop_duplicates( ) function of the DataFrame allows to remove duplicates from the
DataFrame.

• In the original DataFrame,records 6 and 7 are exactly same or duplicates.


• After the execution of the program,record 7 is removed from the DataFrame.
Basic Feature Preprocessing
import pandas as pd
df=pd.read_csv('People.csv')
print("Original Dataframe");
print(df)
#Remove Duplicates and modify the original DataFrame
print("Dataframes after removing duplicates");
df.drop_duplicates(inplace=True)
print(df)

• drop_duplicates( ) keeps the first occurrence of each duplicated row,but can change this
behavior with the keep parameter as
df.drop_duplicates(inplace=True,keep=“last”)
• In the above code,record 6 will be removed and record 7 will be retained as record 7 is the
last occurrence among the duplicates.
Basic Feature Preprocessing
2.1.2.Removing Duplicates based on Specific Columns
• Remove duplicate records based on specific columns according to subset parameter.
• In the example,Remove records if there are duplicates in column “Gender” using the subset
parameter.
import pandas as pd
df=pd.read_csv('People.csv')
#Remove Duplicates based on specific column Gender
print("Dataframes after removing duplicates in column Gender");
df.drop_duplicates(inplace=True,subset='Gender')
print(df)
Basic Feature Preprocessing
2.2.Standardizing Data
• Most often data available in different formats,or different values for the same data and also
different scale of measurements for a particular data.
• Example
1.Date record in different formats.
2.Currency may be different.
3.Height or weight record in different units.
4.Values for gender store in different values i.e M for male and 1 for male. F for female and 0 for
female.
• The above consistencies are corrected before doing further analysis otherwise it produce
inappropriate results.
• Methods:- Formatting,standardize categories and normalization.
2.2.1.Standardize Categories
• Example

Basic Feature Preprocessing
• Standardize in one category like all male classifies as M or 1.Likewise all female classifies
as F or 0.
import pandas as pd
df=pd.read_csv('People.csv')
print("Dataframe before standardizing the categories for column Gender")
print(df)
for x in df.index:
if df.loc[x,"Gender"]=="1":
df.loc[x,"Gender"]="M"
if df.loc[x,"Gender"]=="0":
df.loc[x,"Gender"]="F"
print("Dataframes after standardizing the categories for column Gender");
print(df)
Basic Feature Preprocessing
Basic Feature Preprocessing
2.3.Correcting Errors
• Need to identify and correct typos,data entry errors or inconsistencies.
• A data analyst needs to identify all inconsistencies and remove it.
• Needs to cross-reference data with external sources where it is available for validation.
• Example
 If look at the record at index position 11,the height of Prachi as 1560 centimeters which
can never be the height of a human being.
 Can either a typos or data entry errors.
 Experienced data analyst reach a conclusion that it was 156 instead of 1560 which happen
during a data entry and correct to 156.
 Record at index position 11 is now changed to 156.
import pandas as pd
df=pd.read_csv('People.csv')
df.loc[11,'Height']=156
Basic Feature Preprocessing
2.4.Handling Outliers
Outliers
• Individual point of data that is distant from other points in the dataset.
• Anomaly in the dataset that may be caused by a range of errors in capturing, processing
or manipulating data.
• Data that do not comply with the normal behavior.
Handling Outliers
• Detecting and addressing outliers are very crucial in data analysis which otherwise lead to
biased or inappropriate results.
• Happen due to data entry errors or faulty instruments recording data.
• Depending on the context,outliers can be removed,transformed or otherwise.
• Example
 Consider the Dataframe containing 1560 with value for the column Height,considers as an
outlier and remove by using drop( ) method of pandas DataFrame.

Basic Feature Preprocessing
import pandas as pd
df=pd.read_csv('People.csv')
print("DataFrame before removing the outlier in column Height")
print(df)
df.drop(11,inplace=True)
print("DataFrame after removing the outlier in column Height")
print(df)
Basic Feature Preprocessing
3.Feature Scaling
• A technique used in data preprocessing to normalize the range of independent variables
or features of data.
• To standardize the independent features present in the data.
• Crucial in many machine learning algorithms.
• Each feature contributes equally to the result and improves the convergence speed of
gradient descent algorithms.
• Performed during the data pre-processing to handle highly varying values.
• Example:- It will take 10m and 10 cm with as same regardless of their unit.
3.1.Min-Max Scaling
• A feature scaling technique that rescales data to a fixed range,usually between 0 and 1.
• Used when user know the data is bounded and want everything in a fixed interval.
• First to find the minimum and the maximum value of the column..
• Then will subtract the minimum value from the entry and divide the result by the
Basic Feature Preprocessing

X = Original Value
X min = Minimum Value in the feature
X max = Maximum Value in the feature
X’ = Scaled Value
from sklearn.preprocessing import MinMaxScaler
import numpy as np
data = np.array([[100],[200],[300]])
print("unscaled Data")
print(data)
scaler=MinMaxScaler()
scaled=scaler.fit_transform(data)
print("Scaled Data")
Basic Feature Preprocessing
3.2.Standardization(Z-score normalization)
• A feature scaling technique that centers the data around the mean and scales it based on
standard deviation.
• The resulting distribution has mean = 0 and standard deviation = 1.
• Formula

X = Original feature Value


μ = Mean of the feature
σ= Standard deviation of the feature
X’ = Standardized Value
Basic Feature Preprocessing
from sklearn.preprocessing import StandardScaler
import numpy as np
data = np.array([[100],[200],[300]])
print("Data before standardization")
print(data)
scaler=StandardScaler()
standardized=scaler.fit_transform(data)
print("Data after standardization")
print(standardized)
Basic Feature Preprocessing
Basic Feature Preprocessing
Normalization
• Process of rescaling input features to a standard range or distribution.
• Features with larger magnitudes do not disproportionately influence the learning process
of certain machine learning algorithms, which can be sensitive to the scale of features.
• Used when using algorithms like KNN or neural networks that rely on distance
calculations.
• Can be done using min-max scaling or Z-score normalization.
• Sometimes need to adjust data to a common scale,especially when dealing with varying
units.
• When different DataFrames from various are merged to form a single DataFrame.
• Example – Data Analysis on Hotel Bookings
 Different hotels located in different countries may take bookings in different currencies.
 Booking amount may be in different currencies when the Dataframes from various
countries are merged together.
 Need to convert one standard currency while preprocessing to get accurate results.
• Example :- People.csv
Basic Feature Preprocessing
 Column “Height” is recorded in different units.
 Some are recorded in centimeters and some in feet.
 Normalize to convert all values to either centmeters or all values to feet.
 Convert all values in feet to centimeters using the formula 1 feet = 30.48 centimeters for
normalization.
 Values less than 130 will be in the feet and do the normalization.
 Fixing a threshold value like 130 depends on the domain knowledge of the data analyst.
import pandas as pd
df=pd.read_csv('People.csv')
print("DataFrame before normalizing the column Height")
print(df)
for x in df.index:
if df.loc[x,"Height"]<130:
df.loc[x,"Height"]=30.48*df.loc[x,"Height"]
Basic Feature Preprocessing
Basic Feature Preprocessing
Transformation
• Process of converting data from one format or structure into another.
• Crucial step in data preprocessing,particularly in data integration,data warehousing and
data analysis.
• Goal:- To make data more suitable for analysis,reporting or to ensure compatability with
other systems.
• Techniques are encoding,aggregation and feature engineering.
1.Aggregation
• Process of gathering,summarizing and combining data from multiple sources or records to
produce a consolidated view.
• Essential for simplifying large datasets,making it easier to derive insights,identify trends and
make decisions.
• Various Operations are:-
1.1.Summation:- Adding up values across multiple records(e.g., total sales in a month).
1.2.Averaging:- Calculating the mean value across a set of records(e.g. average customer
Basic Feature Preprocessing
1.3.Counting:- Counting the number of records that meet certain criteria.(e.g.,the number of
transactions).
1.4.Finding Min/Max:- Identifying the minimum or maximum value in a dataset(e.g.,the
lowest or highest price).
1.5.Grouping:- Organizing data into categories or groups,often based on a key variable
(e.g. sum,average) to create more meaningful datasets.
import pandas as pd
df=pd.read_csv("input.csv")
print(df)
#counts the number of people in each dept
c=df.groupby(['dept'])['dept'].count()
print("No:of people in each department")
print(c)
Basic Feature Preprocessing
#sum of salary paid by each dept
sum_salary=df.groupby(['dept'])['salary'].sum()
print("Sum of salary paid by each department")
print(sum_salary)
#Average salary paid by each dept
average_salary=df.groupby(['dept'])['salary'].mean()
print("Average salary paid by each department")
print(average_salary)
#Maximum Salary paid by each dept
max_salary=df.groupby(['dept'])['salary'].max()
print("Maximum salary paid by each department")
print(max_salary)
#Minimum Salary paid by each dept
min_salary=df.groupby(['dept'])['salary'].min()
print("Minimum salary paid by each department")
Basic Feature Preprocessing

You might also like