INTRODUCTION TO PYTHON FOR DATA SCIENCE
Pasty Asamoah
+233 (0) 546 116 102
[email protected]
Kwame Nkrumah University of Science and Technology
School of Business
Supply Chain and Information Systems Dept.
Images used in this presentation are sourced from various online platforms. Credit goes to the
respective creators and owners. I apologize for any omission in attribution, and appreciate the
work of the original content creators.
INTRODUCTION TO MACHINE LEARNING
MACHINE LEARNING
Machine learning is a field of AI that involves the development of
algorithms and statistical models that enable computers to learn and
improve their performance on a specific task without being explicitly
programmed.
learns learns from labeled
from data
unlabele
d data
learns to make
decisions by
interacting with
an environment
MACHINE LEARNING MODELS
Machine learning models can range from simple linear regression to
complex deep neural networks.
Simple linear
regression
SIMPLE LINEAR REGRESSION MODEL
Data preprocessing Build Model Evaluate
Clean data
Select model Check accuracy
Split data
OUR FIRST MACHINE LEARNING
MODEL
Snapshot of the
housing dataset
DATA INGESTION
Import
packages
Load data
DATA CLEANING
Handle duplicates
There are no
missing values
DATA CLEANING
Column data
types
We will be
working with
the integer data
types at this
stage.
FEATURE SELECTION
Predictors
What we want
to predict
MODEL SELECTION
Define: What type of model will it be? A decision tree?
Some other type of model? Some other parameters of
the model type are specified too.
Fit: Capture patterns from provided data. This is the
heart of modeling.
Predict: Just what it sounds like
Evaluate: Determine how accurate the model's
predictions are.
In this case we want to build a very
basic linear regression model using the
scikit learn library
Importing the
MODEL SELECTION linear regression
model
Create the model
Train the model
Importing the
MODEL SELECTION linear regression
model
Create the model
Train the model
We predict with a
MAKING PREDICTIONS set of predictors
The predictions
DECISION TREE
SIMPLE DECISION TREE MODEL
Data preprocessing Build Model Evaluate
Clean data
Select model Check accuracy
Split data
DECISION TREE MODEL
Machine learning models can range from simple linear regression to
complex deep neural networks.
Decision Tree
DECISION TREE Import decision tree from sklearn
model
Train model
Make predictions
Predicted VS
Actual are the
same. That is a
100% accuracy.
BUT WHY??
LETS MODIFY OUR MODEL BY
INTRODUCING TRAINING AND TEST
DATASETS
We realized that our model performed well with an
accuracy of 100%. This is unlikely in real-world
scenerios.
The reason for the 100% accuracy is that, we were
trying to predict Y values with X values that the model
has seen before. The model saw it in the Training Stage
What about testing our model on data that the model
has not seen before??
Let’s give it a shot!!!
INGESTION, CLEANING, AND
SELECTING VARIABLES
We import the
decision tree
model
Dependent Independent variable
variable
SPLIT DATA
The method for
splitting the data
SPLIT DATA
data 80% for training and 20%
for testing
Dataset for
training
Dataset for
testing
MODEL SELECTION
Train dataset
Test dataset
MODEL PERFORMANCE
Checks error
margin
Error margin
LETS MODIFY THE MODEL A BIT BY
SPECIFYING LEAVES
Error margin before updating parameter
Error margin after updating
parameter
PROBLEM OF UNDERFITTING AND
OVERFITTING
DIFFERENT LEVELS OF LEAVES
Error margin is high for 50 leaves
HANDLING CATEGORICAL DATA
CATEGORICAL DATA
Have you realized that we couldn’t include these attributes in the model?
HANDLE CATEGORICAL COLUMNS
Label Encoder One-Hot-Encoder Dummies
LABEL ENCODERS
Importing LabelEncoder
LABEL ENCODERS’
Columns of interest. We
believe that these columns
predict house prices. We
need to convert them to
numerical forms
TRANSFOMING CATEGORICAL
COLUMNS
Instantiate Label encoder Transform values Categorical column
to convert
ADD TRANSFORMED COLUMNS TO
DATAFRAME
New column name Transformed values
ADD TRANSFORMED COLUMNS TO
DATAFRAME
New column name Transformed values
SNAPSHOT OF TRANSFORMED
COLUMNS
New columns added
INDEPENDENT & DEPENDENT
VARIABLES
Select columns based on data types. Drop the price column. By default, it will be included because
Exclude columns with data type we are selecting all columns other than objects.
object
DUMMIES columns
Pandas method to handle
categorical columns
Note that it create multiple columns for each of them
based on the number of unique values in the column
DUMMIES columns
Pandas method to handle
categorical columns
Note that it create multiple columns for each of them
based on the number of unique values in the column
INDEPENDENT & DEPENDENT
VARIABLES
Select columns based on data types. Drop the price column. By default, it will be included because
Exclude columns with data type we are selecting all columns other than objects.
object
Task 1: Build a model with either linear
regression or decision tree and report
on the best model. Remember to apply
all skills and knowledge you have
acquired especially splitting data set
into training and testing, and encoding
categorical columns
ENSEMBLE MODELS
RANDOM FOREST MODEL
Ensemble models combine multiple individual models to
improve predictive performance. A popular ensemble method is
RandomForest, but there are others like Gradient Boosting and
AdaBoost.
ANY QUESTIONS??