FEMODULE - 1 Daa DBMS Python Algorithms
FEMODULE - 1 Daa DBMS Python Algorithms
• Each row here represents an individual observation or instance (in this case, a house).
• Each column represents a specific characteristic or attribute of that instance.
Introduction to Feature Engineering
1.INTRODUCTION TO DATA
“FEATURES”
• A numeric representation of raw data.
• An individual measurable property or characteristic of the phenomenon you're trying to
model.
• Inputs for a machine learning algorithm.
• Example:-In a dataset predicting house prices, features could include the square footage,
number of bedrooms, neighborhood, and year built.
• Key points:
1.Raw Features: These are the unprocessed variables from the dataset.
Example:-Raw numerical values of a person's age, income, etc.
2.Derived Features: These are features created by transforming or combining raw features.
Example:Creating a "age group" feature from the raw age feature.
(e.g., <18 = "child", 18-65 = "adult", >65 = "senior").
Introduction to Feature Engineering
FEATURE ENGINEERING
• Act of extracting features from raw data and transforming them into formats that are
suitable for the machine learning model.
• A crucial step in the machine learning pipeline, i.e the right features can ease the difficulty
of modeling, and therefore enable the pipeline to output results of higher quality.
• Might involve several steps, such as:
1. Data Cleaning: Removing or correcting noisy, missing, or inconsistent data.
2. Data Transformation: Changing the scale or format of the data (e.g.,
normalization, standardization).
3. Feature Creation: Combining or manipulating features to create new ones (e.g.,
adding the "age group" feature mentioned above).
4. Feature Selection: Choosing which features are most important to the model. This
can be done manually or through statistical techniques.
5. Feature Encoding: Converting categorical data into a numeric format, for
example, through one-hot encoding or label encoding.
Introduction to Feature Engineering
FEATURE ENGINEERING
Example: Feature Engineering for a Housing Price Prediction
• Tasked with building a machine learning model to predict the price of houses based
on various factors.
1)Raw Data:
Features might have like:
1. Square footage of the house (numerical)
2. Number of bedrooms (numerical)
3. Number of bathrooms (numerical)
4. Year the house was built (numerical)
5. Neighborhood (categorical)
2)Feature Engineering:
The following transformations could perform :
Introduction to Feature Engineering
FEATURE ENGINEERING
1. Square Footage: Create a feature that represents the "price per square foot"
(Price/Square Foot).
2. Age of House: Create a new feature for the "age of the house" by subtracting the year
built from the current year.
3. Neighborhood Encoding: Use one-hot encoding to convert the "Neighborhood"
feature into binary columns (e.g., is the house in neighborhood A? Yes or no).
4. Price Category: Create a categorical feature like "Price Range" based on the price
of the house (e.g., low, medium, high).
• After these transformations, the data is more structured, and the model can better
identify patterns in the features.
Introduction to Feature Engineering
FEATURE ENGINEERING
Introduction to Feature Engineering
FEATURE ENGINEERING
Introduction to Feature Engineering
IMPORTANCE OF FEATURES IN MACHINE LEARNING
Features
•Attributes or variables, are the individual measurable properties or characteristics of
the phenomenon being observed.
•In machine learning, it serve as the input to our models, allowing them to learn patterns
and make predictions.
•Their quality directly impacts the performance, interpretability, and efficiency of any
machine learning model.
1.Model Performance and Accuracy
Garbage In, Garbage Out: If the features are irrelevant, noisy, redundant, or
poorly chosen, even the most sophisticated algorithms will struggle to achieve good
performance.
Signal vs. Noise: Good features amplify the signal (the underlying patterns) and
reduce the noise in the data, making it easier for the model to learn meaningful
relationships.
Introduction to Feature Engineering
Predictive Power: High-quality features are directly correlated with the target variable,
providing strong predictive power.
Overfitting and Underfitting: Poor feature selection can lead to overfitting (model
learns noise) or underfitting (model doesn't capture underlying patterns).
2. Model Interpretability and Explainability:
Understanding Model Decisions: When features are meaningful and well-defined, it's
easier to understand why a model makes certain predictions. This is particularly
important in domains like healthcare or finance where transparency is critical.
Identifying Key Drivers: Feature importance techniques help identify which features
contribute most to the model's output, providing insights into the underlying process.
Trust and Acceptance: Interpretable models are more likely to be trusted and
accepted
by stakeholders and end-users.
3.Computational Efficiency and Scalability:
Reduced Training Time: Fewer, more relevant features mean less data for the model to
Introduction to Feature Engineering
Lower Memory Requirements: Storing and manipulating fewer features reduces memory
consumption, which is crucial for large datasets.
Simpler Models: Models built with fewer features are often simpler, making them easier to
deploy and maintain.
Mitigating the Curse of Dimensionality: As the number of features increases, the sparsity of
data in high-dimensional spaces makes it exponentially harder for models to learn effectively.
4. Data Quality and Preparation:
Feature Engineering: Art of creating new features from raw data to improve model
performance and often involves domain expertise and creativity (e.g., combining existing
features, extracting information from text/images).
Feature Scaling and Normalization: Ensuring features are on a similar scale (e.g., min-
max scaling, standardization) prevents features with larger ranges from dominating the
learning process.
Handling Missing Values: Appropriate strategies for dealing with missing data in features
(imputation, removal) are essential.
Introduction to Feature Engineering
Outlier Detection and Treatment: Outliers in features can skew model training and
lead to inaccurate predictions.
5. Robustness and Generalization:
Less Prone to Noise: Well-engineered features make models more robust to noisy or
irrelevant data points.
Better Generalization: Models trained on relevant and representative features are more
likely to generalize well to unseen data, rather than just memorizing the training
examples.
Data Types And Features
DATA
• A collection of information gathered by observations,measurements,research or analysis.
• Consists of facts, numbers,names ,figures or even description of things.
• Data Types:-What type of data is used to analyze the data.
• Dictate the types of transformations and algorithms that can be applied to features.
• Broadly classified in three types
1.Structured Data
• Information organize in a predefined manner,often in tabular form with rows and columns.
• Easily searchable and efficiently process by algorithms and databases.
Characteristics
Follows a consistent format and structure,with predefined fields and types.
Stores in relational databases (like SQL databases) or Excel Files which uses a table-based
format.
Easily query using standard query languages (SQL) with its organized structure.
Data Types And Features
Contains clearly defined and constrained types of data like numbers,dates and strings.
• Examples:-Spreadsheets with rows and columns,SQL databases with tables , CSV files
with comma-separated values.
2.Semi-structured Data
• Information that does not conform to a rigid structure.
• Contains tags or markers to separate semantic elements and enforce hierarchies of
records and fields within the data.
• Lies between structured and unstructured data.
• Offers more flexibility than fully structured data while maintaining some level of
organization.
Characteristics
Schema or structure is not fixed,allowing for variations and the inclusion of different
attributes for different records.
Metadata provides information with structure and hierarchy ie tags or markers.
Data Types And Features
Tree-like structure,allowing for nested data elements.
Store in formats which support nested structures, such as JSON or XML and process by
NoSQL databases.
• Example:- JSON(JavaScript Object Notation),XML(Extensible Markup
Language),YAML(Yet Another Markup Language).
3.Unstructured Data
• Information that lacks a predefined data model or organization.
• Does not fit neatly into tables or rows and columns.
• Text-heavy and include a wide variety of content types and formats.
Characteristics
Doesn’t follow a specific schema or structure.
Can be in the form of text,images,audio,video,social media posts,emails.
More challenging to store,manage and analyze due to its lack of structure.
Data comes in large volumes normally.
Data Types And Features
Requires advanced techniques and tools,such as Natural Language Processing(NLP),
Machine Learning(ML),Artificial Intelligence(AI) and text analytics to extract
meaningful information and insights.
Provide valuable insights that are not easily obtainable from structured data alone,despite
its complexity.
• Example:- Text Documents(Word documents,PDFs),e-mail messages,social media
posts(Tweets,Facebook posts),multimedia files(images,audio files,video files),web
pages,log files etc.
Structured Data
• Deals with Data Analytics.
• Classified as :-
1.Qualitative (non-numeric) data.
2.Quantitative (numeric) data.
Data Types And Features
1.Qualitative (Categorical) Data
• Features that can take one of a limited number of values.
• Categorizes the data into various categories.
• Explains the features of the data such as in statistics , gender of people , their family name and
occupation.
• Categorized into two :-
1. Nominal Data
2. Ordinal Data
1.Nominal Data
• Data type having categories or names without ranked or inherent ordered.
• Used to categorize observations into groups and the groups are not comparable.
• Represent using frequency tables and bar charts, which displays the number or proportion of
observations in each category.
• Example:- A frequency table for gender might show the number of males and females in a
sample of people.
Data Types And Features
2.Ordinal Data
• Categories with a meaningful order or ranking.
• Measure subjective attributes or opinions , where there is a natural order to the responses.
• Example :- Education level(Elementary, Middle,High School,College),job position (Manager,
Supervisor, Employee).
2.Quantitative (Numeric) Data
• Values with numeric types (int, float, etc.).
• Used to represent the height,weight,length etc.
• Classified into two :-
1. Discrete Data
2. Continuous Data
1.Discrete Data
• Represent countable values with a finite number of outcomes.
• Data types have values that can be easily counted as whole numbers.
Data Types And Features
• Data type in statistics that only uses discrete value or single values.
• Example:- Number of students in a class,number of members in a family etc.
2.Continuous Data
• Take any value within a range.
• Type of quantitative data that represent the data in a continuous range.
• Variable in the data set can have any value between the range of the data set.
• There will be a minimum and maximum value and all other values falls between the
minimum and maximum values.
• Example:- Weight of each student in a class,salary of workers in a factory.
Data Types And Features
Data Types And Features
• Two types of Continuous Data
1.Interval Scaled Data
2. Ratio Scaled Data
• drop_duplicates( ) keeps the first occurrence of each duplicated row,but can change this
behavior with the keep parameter as
df.drop_duplicates(inplace=True,keep=“last”)
• In the above code,record 6 will be removed and record 7 will be retained as record 7 is the
last occurrence among the duplicates.
Basic Feature Preprocessing
2.1.2.Removing Duplicates based on Specific Columns
• Remove duplicate records based on specific columns according to subset parameter.
• In the example,Remove records if there are duplicates in column “Gender” using the subset
parameter.
import pandas as pd
df=pd.read_csv('People.csv')
#Remove Duplicates based on specific column Gender
print("Dataframes after removing duplicates in column Gender");
df.drop_duplicates(inplace=True,subset='Gender')
print(df)
Basic Feature Preprocessing
2.2.Standardizing Data
• Most often data available in different formats,or different values for the same data and also
different scale of measurements for a particular data.
• Example
1.Date record in different formats.
2.Currency may be different.
3.Height or weight record in different units.
4.Values for gender store in different values i.e M for male and 1 for male. F for female and 0 for
female.
• The above consistencies are corrected before doing further analysis otherwise it produce
inappropriate results.
• Methods:- Formatting,standardize categories and normalization.
2.2.1.Standardize Categories
• Example
Basic Feature Preprocessing
• Standardize in one category like all male classifies as M or 1.Likewise all female classifies
as F or 0.
import pandas as pd
df=pd.read_csv('People.csv')
print("Dataframe before standardizing the categories for column Gender")
print(df)
for x in df.index:
if df.loc[x,"Gender"]=="1":
df.loc[x,"Gender"]="M"
if df.loc[x,"Gender"]=="0":
df.loc[x,"Gender"]="F"
print("Dataframes after standardizing the categories for column Gender");
print(df)
Basic Feature Preprocessing
Basic Feature Preprocessing
2.3.Correcting Errors
• Need to identify and correct typos,data entry errors or inconsistencies.
• A data analyst needs to identify all inconsistencies and remove it.
• Needs to cross-reference data with external sources where it is available for validation.
• Example
If look at the record at index position 11,the height of Prachi as 1560 centimeters which
can never be the height of a human being.
Can either a typos or data entry errors.
Experienced data analyst reach a conclusion that it was 156 instead of 1560 which happen
during a data entry and correct to 156.
Record at index position 11 is now changed to 156.
import pandas as pd
df=pd.read_csv('People.csv')
df.loc[11,'Height']=156
Basic Feature Preprocessing
2.4.Handling Outliers
Outliers
• Individual point of data that is distant from other points in the dataset.
• Anomaly in the dataset that may be caused by a range of errors in capturing, processing
or manipulating data.
• Data that do not comply with the normal behavior.
Handling Outliers
• Detecting and addressing outliers are very crucial in data analysis which otherwise lead to
biased or inappropriate results.
• Happen due to data entry errors or faulty instruments recording data.
• Depending on the context,outliers can be removed,transformed or otherwise.
• Example
Consider the Dataframe containing 1560 with value for the column Height,considers as an
outlier and remove by using drop( ) method of pandas DataFrame.
Basic Feature Preprocessing
import pandas as pd
df=pd.read_csv('People.csv')
print("DataFrame before removing the outlier in column Height")
print(df)
df.drop(11,inplace=True)
print("DataFrame after removing the outlier in column Height")
print(df)
Basic Feature Preprocessing
3.Feature Scaling
• A technique used in data preprocessing to normalize the range of independent variables
or features of data.
• To standardize the independent features present in the data.
• Crucial in many machine learning algorithms.
• Each feature contributes equally to the result and improves the convergence speed of
gradient descent algorithms.
• Performed during the data pre-processing to handle highly varying values.
• Example:- It will take 10m and 10 cm with as same regardless of their unit.
3.1.Min-Max Scaling
• A feature scaling technique that rescales data to a fixed range,usually between 0 and 1.
• Used when user know the data is bounded and want everything in a fixed interval.
• First to find the minimum and the maximum value of the column..
• Then will subtract the minimum value from the entry and divide the result by the
Basic Feature Preprocessing
X = Original Value
X min = Minimum Value in the feature
X max = Maximum Value in the feature
X’ = Scaled Value
from sklearn.preprocessing import MinMaxScaler
import numpy as np
data = np.array([[100],[200],[300]])
print("unscaled Data")
print(data)
scaler=MinMaxScaler()
scaled=scaler.fit_transform(data)
print("Scaled Data")
Basic Feature Preprocessing
3.2.Standardization(Z-score normalization)
• A feature scaling technique that centers the data around the mean and scales it based on
standard deviation.
• The resulting distribution has mean = 0 and standard deviation = 1.
• Formula