0% found this document useful (0 votes)
68 views39 pages

IDA117V Supervised ML

The document provides an overview of machine learning, including concepts such as supervised and unsupervised learning, data preprocessing, and the importance of handling noisy data. It discusses various machine learning models, including linear regression and logistic regression, as well as classification metrics like accuracy, precision, recall, and F1-score. Additionally, it highlights the challenges of overfitting and underfitting in model training.

Uploaded by

lufunosape
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views39 pages

IDA117V Supervised ML

The document provides an overview of machine learning, including concepts such as supervised and unsupervised learning, data preprocessing, and the importance of handling noisy data. It discusses various machine learning models, including linear regression and logistic regression, as well as classification metrics like accuracy, precision, recall, and F1-score. Additionally, it highlights the challenges of overfitting and underfitting in model training.

Uploaded by

lufunosape
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

IDA117V

Introduction to Machine learning.


Machines learning from Data

Machines /Models learning to predict the future based on patterns seen from
current data.

Learns patterns [Features ] and resulting outcomes [targets] of such patterns.

Not as straightforward like that, because data may be very dirty. Chances are
the machine can learn incorrect things from the data.
Data preprocessing + Training mechanisms

Handling noisy data:

EDA , so that we can see what the data is, preprocess the data ie removing NaN
values.

Selecting training / testing mechanisms that will reduce chances of learning


the noise.
Supervised learning

Supervised machine learning

Models are given what exactly to learn, must learn to associate certain patterns
with specific targets. There are features[inputs] and targets [outputs or labels].

Example : building a model to predict rain given atmospheric conditions


Unsupervised machine learning

Models are learn patterns or categories that exist in the data. No labels or
targets are given.

ie Group similar features / trends together.

Clustering, Dimensionality reduction and Association rules.


Supervised Machine learning examples

Predict Insurance fraud /valid based on some features


[0=fraud, 1=valid]

Features Targets

Id,policy number, policy status, alive , 0/1


Supervised Machine learning examples

Data about house prices prediction

Features Target

Floor area, number of stories, size of yard, surburb R


Classification vs Regression

Classification: Regression

[ Features ] [target] [Features] [ targert ]

Target is a discrete value / is a category. Target is a continuous value

Example: based on certain weather conditions Example: based on house features, location, size
one can train a machine learning algorithm to one can train a machine learning algorithm to
predict rain or no rain will occur. predict the price of a house.
Classification vs Regression

The difference in classification and regression is important because there are


some measurements that are used to evaluate specifically the regression
model and cannot be used to evaluate classification models.
Classification: Binary class Multi-class

Binary classification: Multi-class classification:

There are only two targets / outputs / classes in More than two targets / outputs classes / in the
the data data.
Linear regression

Regression: method that tries to determine the strength and nature of a relationship between the output (Y) and
the independent variable s (X).

Linear regression: There is a linear relationship between the independent variable and the dependent variable:
when independent variable changes, the dependent variable also changes linearly

Classic Example The price (Y) of a house may depend on the size ( X ) when the size of the house increases the
price tend to increase.
Linear regression continues. Car weight vs
mileage

Another example of Linear regression:


Identifying Linear relationships in data

Laerd Stats 2014


What a linear regression model learns

Linear regression models learns a line

y = mx + C Where y is the target or dependent variable and x is independent


observations / variables and C is the y intercept.
A few lines can be drawn that estimates the relationship between x and y variables.
Linear regression

Linear regression model learns a function


y^=mx+c that minimises the vertical error
(residual ) between this line and all the data
points.
The residual

We would want to know how good the line / model fits the data. We want to see
how much error is made by the model by calculating the difference between the
predicted and the actual value:

ei = yi - yi^ where yi is the actual value , and yi^ is the predicted value of the ith
point.
Residuals

http://wiki.engageeducation.org.au/further-maths/data-analysis/residuals/
Residuals

Sometimes the residuals can be negative. ei = yi - yi^, if y^ is greater than


y.
More convenient to work with the squared residuals: e2

The best fit line will have a minimum sum of squared residuals ie sum of
squared residuals from all the data points : ∑ ei2
Important measures of regression fit

The standard deviation of the residuals, also called the Root Mean Squared
Error RMSE. A good rmse value is close to 0. Eg [ 0.2 ]

RMSE = where n = total points in data.


RMSE for Multiple regression

Def:

Simple regression:

Only one predictor [ X ] variable [independent variable], and the other varaible
Y is the response or dependent variable.

Multiple regression : Where there is more than one predictor variable that
affects the output/ response variable [ Y ]. RMSE changes slightly for MLR
RMSE for SLR and MLR

RMSE SLR RMSE MLR


Important measures of regression fit

Coefficient of determination R - Squared

Good R-squared is close to 1, [100%]


Overfitting and Underfitting

Overfitting Underfitting

Model knows too well about the data and Models learned nothing from the data, and
cannot generalize well to unseen data hence it cannot predict unseen data,

Training accuracy is high. eg 90% 50% accuracy.

Testing accuracy is very low. eg 60%


Preventing overfitting

Train with more data

Data augmentation [ artificial data ]

Feature selection, removing features that do not inform the outcome.


Classification
Predicting a class not a continuous variable
Models for classification

KNN

Logistic regression

Decision trees

SVM

ect.
Logistic regression

Types:

Binary or Binomial : target variable only have two possible values

Multi nominal: target variable takes three or more unordered values

Ordinal: target variable takes three or more ordered values


Linear regression vs Logistic regression

Linear regression y = mx+c Logistic regression: sigmoid function


Logistic regression

Chooses a threshold, say 0.5 and when the model returns more than 0.5 then
the prediction is given the class above the sigmoid curve.
Example: LRM

Using indian diabetic data:

Logistic regression Model.


Example: KNN

KNN
K nearest neighbours: assumes that similar things have a lot of features in
common. Examples Dog vs Cats , lions vs giraffes ect.
There is a small distance between similar things and a very huge difference
between different things.
KNN

The K, is number , hyper parameter in the model, 1 - N.

The algorithm computes distance between a query point and K Neighbours of


the point.
KNN

Using K= 2
KNN example.
Confusion matrix

TP : True positive

TN: True negative

FN: False negative

FP: False Positive


Confusion matrix

TP : True positive
Predicted diabetes and it is true
TN: True negative
Predicted no diabetes and it is true
FN: False negative
Predicted no diabetes yet there is diabetes [TYPE 2 Error]
FP: False Positive : predicted diabetes yet there is no diabetes. [TYPE 1 ERROR]
Metrics: Recall ,precision and accuracy

Measuring Model performance:


Recall : TP/(TP+FN) if we make less of FN , TYPE 2 error, the recall will be close to 1.

Precision : TP/(TP+FP) , if we make less of FP TYPE 1 error the precision will be


higher : close to 1.

Accuracy: (TP+TN) / (TP+FN+FP+TN)


Metrics : F1-Score

Better than accuracy because it take into account the class imbalances.

A harmonic mean between Recall and precision. Can be interpreted for all
scenarios. As opposed to either recall or precision.

You might also like