Supervised Learning with Python
Engr. Elisa G. Eleazar
School of Chemical, Biological, and Materials Engineering and Sciences
DS100: APPLIED DATA SCIENCE 1
Outline
Module 3.8: Learning Outcomes
SUPERVISED LEARNING IN PYTHON
Classification 1. Define Machine Learning and differentiate the types
2. Differentiate Classification from Regression
Regression 3. Write Python codes for Classification and Regression
problems
DS100: APPLIED DATA SCIENCE 2
Supervised Learning
• the science and art of giving computers the ability to learn to make decisions from data without being
explicitly programmed
MACHINE Supervised Learning Unsupervised Learning Reinforcement Learning
LEARNING uses labeled data uses unlabeled data machines or software agents
ex: learning to predict whether ex: clustering Wikipedia entries to interact with an environment
an email is spam or not categories
• the aim is to build a model that is able to predict the target variable given the predictor variables
• Independent Variable features predictor variables
SUPERVISED • Dependent Variable target response variable
LEARNING
Classfication Regression
the target variable consists of categories the target is a continuous variable
DS100: APPLIED DATA SCIENCE 3
Supervised Learning
• the aim is to build a model that is able to predict the target variable given the predictor variables
• Independent Variable features predictor variables
SUPERVISED • Dependent Variable target response variable
LEARNING
Classfication Regression
the target variable consists of categories the target is a continuous variable
Predictor variables Target variable
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
DS100: APPLIED DATA SCIENCE 4
Supervised Learning
Python Packages for Machine Learning
DS100: APPLIED DATA SCIENCE 5
Classification
DATA PRE-PROCESSING
DS100: APPLIED DATA SCIENCE 6
Classification
EXPLORATORY DATA ANALYSIS
DS100: APPLIED DATA SCIENCE 7
Classification
EXPLORATORY DATA ANALYSIS
DS100: APPLIED DATA SCIENCE 8
Classification
VISUAL EXPLORATORY DATA ANALYSIS
DS100: APPLIED DATA SCIENCE 9
Classification
• process of building a model that is able to predict the categorical target variable given the predictor
variables
CLASSIFICATION
• Training (labeled) data Label
K-NEAREST • algorithm that predicts the label of a data by taking the majority vote of the ‘k’ closest labeled data
NEIGHBOR points
Classification: red if k=3; green if k=5
DS100: APPLIED DATA SCIENCE 10
Classification
MODEL BUILDING
.fit() trains the model to the data .predict() predicts the label of an unlabeled data point
DS100: APPLIED DATA SCIENCE 11
Classification
MODEL BUILDING
Requirements for the use of scikit-learn
• data must be a NumPy array or a pandas DataFrame
• features must be continuous variables
• there must be no missing values
DS100: APPLIED DATA SCIENCE 12
Classification
MODEL BUILDING
DS100: APPLIED DATA SCIENCE 13
Classification
MEASURING MODEL PERFORMANCE
• commonly-used metric in measuring model performance in classification problems
• number of correct predictions divided by the total number of data points
ACCURACY • normally done by splitting data into training set and test set
• fit/train the classifier on the training set
• make predictions on the test set
• compare predictions with the known labels
train_test_split() randomly splits the data
Arguments: Results (4 arrays):
• feature data • training data
• targets/labels • test data
• test size • training labels
• test labels
DS100: APPLIED DATA SCIENCE 14
Classification
MEASURING MODEL PERFORMANCE
DS100: APPLIED DATA SCIENCE 15
Classification
MEASURING MODEL PERFORMANCE
DS100: APPLIED DATA SCIENCE 16
Classification
MODEL COMPLEXITY
Model Complexity Curve
Smaller k more complex model can lead to overfitting
Larger k smoother decision boundary less complex model
DS100: APPLIED DATA SCIENCE 17
Regression
• the aim is to build a model that is able to predict the target variable given the predictor variables
• Independent Variable features predictor variables
SUPERVISED • Dependent Variable target response variable
LEARNING
Classfication Regression
the target variable consists of categories the target is a continuous variable
DS100: APPLIED DATA SCIENCE 18
Regression
DATA PRE-PROCESSING
CRIM: per capita crime rate
NX: nitric oxide concentration
RM: average number of rooms
per dwelling
MEDV: median value of owner
occupied homes in hundreds
of dollars (target variable)
DS100: APPLIED DATA SCIENCE 19
Regression
DATA PRE-PROCESSING
DS100: APPLIED DATA SCIENCE 20
Regression
VISUAL EXPLORATORY DATA ANALYSIS
DS100: APPLIED DATA SCIENCE 21
Regression
VISUAL EXPLORATORY DATA ANALYSIS
DS100: APPLIED DATA SCIENCE 22
Regression
MODEL BUILDING: LINEAR REGRESSION AND VALIDATION: R^2
DS100: APPLIED DATA SCIENCE 23
Outline
Module 3.8: Learning Outcomes
SUPERVISED LEARNING IN PYTHON
Classification 1. Define Machine Learning and differentiate the types
2. Differentiate Classification from Regression
Regression 3. Write Python codes for Classification and Regression
problems
DS100: APPLIED DATA SCIENCE 24
Supervised Learning with Python
Engr. Elisa G. Eleazar
School of Chemical, Biological, and Materials Engineering and Sciences
DS100: APPLIED DATA SCIENCE 25