Data Science Notes | PDF | Eigenvalues And Eigenvectors | Machine Learning
0% found this document useful (0 votes)
50 views37 pages

Data Science Notes

Support vector machines (SVMs) are a supervised machine learning method that finds the optimal hyperplane to categorize new examples. Decision trees are another supervised method that splits data into subsets based on attribute values to classify new examples. Both have advantages and disadvantages, such as SVMs performing poorly on large or noisy datasets while decision trees are prone to overfitting. Ensemble methods like boosting and bagging combine multiple models to overcome weaknesses of individual models.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
50 views37 pages

Data Science Notes

Support vector machines (SVMs) are a supervised machine learning method that finds the optimal hyperplane to categorize new examples. Decision trees are another supervised method that splits data into subsets based on attribute values to classify new examples. Both have advantages and disadvantages, such as SVMs performing poorly on large or noisy datasets while decision trees are prone to overfitting. Ensemble methods like boosting and bagging combine multiple models to overcome weaknesses of individual models.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 37

Decision Trees and Support Vector

Machines
Support Vector Machine
Supervised ML which is a non linear machine learning method (it can capture
more complex data) where with the given data, it finds the best hyperplanes
to categorise new examples

Hyperplanes

A plane of dimensionality lower than your data


If your data is 2D, hyperplane is a line (1D)
If your data is 1D, hyperplane is a point (0D)
If your data is 3D, hyperplane is a plane (2D) etc…

How to seperate the data

linearly separable
non linearly separable

Choosing a hyperplane:

We need to maximise the boundaries


We also need to obey some constraints
The margin is the distance from the plane to the closest points – in SVM we
optimise this
SVM considers only the closest (hardest to classify points) – these are the
support vectors (use this to calculate the dot product)
the projection is used to assign a class

Doing this mathematically:

1. calculating d which is the distance between two vectors

2. The constraint:

need to ensure that no point is classified on the wrong side of the line

3. The optimisation

We can find the hyperplane that maximises the distance x2 – x1, which is
the distance between the support vectors.
However, even with this, not all data will fall onto the correct side of any line so
there is a need to relax the contraint and this is called soft margin SVM

The kernel trick


this is used when a reasonable hyperplane can't be drawn
this converts the lower dimension space to a higher dimension space

Types of kernel

polynomial kernel where d is a hyperparameter


radial basis function kernel
Create non-linear combinations of our features to lift your samples onto a
higher-dimensional feature space where we can use a linear decision
boundary to separate your classes
The most used kernel in SVM
Pros and cons of SVMs

Pros:
It works really well with a clear margin of separation
It is effective in high dimensional spaces.
It is effective in cases where the number of dimensions is greater than
the number of samples.
It uses a subset of training points in the decision function (called support
vectors), so it is also memory efficient.
Cons:
It doesn’t perform well when we have large data set because the
required training time is higher
It also doesn’t perform very well, when the data set has more noise i.e.
target classes are overlapping
SVM doesn’t directly provide probability estimates, these are calculated
using an expensive five-fold cross-validation. It is included in the
related SVC method of Python scikit-learn library.

Decision trees
it is an intuitive algorithm

Entropy and information

entropy classifies how homogenous the info is

information is:
Simply the difference between the entropy before and after splitting.
If information is high entropy is almost zero if information is low entropy is
almost 1
So we want to find the combination of splits and thresholds that maximises
the information gain over the tree

As a whole:

Decision trees pros and cons

Pros:
Computationally cheap to use, easy for humans to understand results and it
can deal with irrelevant features also
Cons:
Prone to Overfitting.(It refers to the process when models is trained on
training data too well that any noise in testing data can bring negative
impacts to performance of model.)

Boosting + Bagging
To overcome the limitations of a weak learner we can use booting or bagging.
Both methods use an ensemble of weak learners to build a strong learner
Boosting – choose next learner based on the errors of the last learner
(gradient boosted decision trees)
Bagging – stochastically choose next learners (random forests)
Exploratory Data Analysis
Exploratory Data Analysis
method of looking at data that does not include formal statistical modelling
and inference

Classes of EDA:
Univariate graphical
Univariate Non graphical
Multivariate graphical
Univariate graphical
looking at a single value from an experiment and getting an idea about the
distribution of the value

Categorical, Ordinate, Interval data


A categorical variable (sometimes called a nominal variable) is one that has
two or more categories, but there is no intrinsic ordering to the categories.
Example – hair colour
An ordinal variable is similar to a categorical variable. The difference between
the two is that there is a clear ordering of the categories.
Example – economic groups
An interval variable is similar to an ordinal variable, except that the intervals
between the values of the numerical variable are equally spaced.
Example – evenly spaced price ranges

Categorical non-graphical representations


Characteristics of interest for a categorical variable which are simply the
range of values and the frequency of occurence for each value
A simple tabulation of the frequency of each category is the best univariate
non-graphical EDA for categorical data
Quantitative data representations
The characteristics of the population distribution of a quantitative variable are
its centre, spread, modality (number of peaks in the probability distribution
function), shape (including “heaviness of the tails”), and outliers.

Non graphical representations of quantitative data


In most situations it is worthwhile to think of univariate nongraphical EDA as
telling you about aspects of the histogram of the distribution of the variable of
interest.
If the quantitative variable does not have too many distinct values, a
tabulation, as we used for categorical data, will be a worthwhile univariate,
non-graphical technique.
Mostly, for quantitative variables we are concerned here with the quantitative
numeric (non-graphical) measures which are the various sample statistics

Descriptors of quantitative data


1. Modality - the number of peaks there are

2. Central tendency

Mean - the common and useful measures are the arithmetic mean, median
and mode. There are other means such as geometric, harmonic, turncated
or Winsorized means
Median - the middle value after all values are in an ordered list. for
symmetric distributions, the mean and median coincide

3. Spread - how far away from the centre we are still likely to find data values.
The standard deviation is the square root of the variance
Variance and Standard deviation
Variance is an important property that they are additive for any number
of different independent sources of variation
Standard deviation has the same units as the original data
Inter-quartile range
The IQR is a more robust measure of spread than the variance or
standard deviation.
The IQR is not affected by extreme outliers as strongly (if at all).
Percentiles
more flexible version of quartiles
4. Skew - measure of asymmetry

5. Kurtosis - measure of peakedness relative to a Gaussian shape

Univariate and Graphical


1. Histograms

Barplot in which each bar represents the frequency (count) or proportion


(count/total count) of cases for a range of value
The only one of these techniques that makes sense for categorical data
Generally you will choose between about 5 and 30 bins • It is often
worthwhile to try a few different bin sizes/numbers
It is very instructive to look at multiple samples from the same population to
get a feel for the variation that will be found in histograms

2. Boxplots

Boxplots are very good at presenting information about the central tendency,
symmetry and skew, as well as outliers, although they can be misleading
about aspects such as multimodality

3. Outliers

The term “outlier” is not well defined in statistics, and the definition varies
depending on the purpose and situation. The “outliers” identified by a boxplot,
which could be called “boxplot outliers” are defined as any points more than
1.5 IQRs above Q3 or more than 1.5 IQRs below Q1.

4. Violin plots

A violin plot is like a box plot, which shows peaks in the data. It is used to
visualize the distribution of numerical data. Unlike a box plot that can only
show summary statistics, violin plots depict summary statistics and the
density of each variable.

Multivariate non graphical


1. Cross tabulation - basical bivariate non-graphical EDA technique

2. Correlation of variables

Cramer’s V is used to calculate the correlation between nominal categorical


variables. Recall that nominal variables are ones that take on category labels
but have no natural ordering
The value for Cramer’s V ranges from 0 to 1, with 0 indicating no association
between the variables and 1 indicating a strong association between the
variables.

3. Quantitative variable statistics (covariance)


For two quantitative variables, the basic statistics of interest are the sample
covariance and/or sample correlation. Positive covariance values suggest that
when one measurement is above the mean the other will probably also be
above the mean, and vice versa.

Correlation is closely related to covariance

4. Covariance and correlation matrices

When we have many quantitative variables the most common non-graphical


EDA technique is to calculate all of the pairwise covariances and/or
correlations and assemble them into a matrix.

Graphical multivariate
1. Univariate plots by category

When we have one categorical (usually explanatory) and one quantitative


(usually outcome) variable, graphical EDA usually takes the form of
“conditioning” on the categorical random variable. This simply indicates that
we focus on all of the subjects with a particular level of the categorical
random variable, then make plots of the quantitative variable for those

subjects.

2. Scatterplots

For two quantitative variables, the basic graphical EDA technique is the
scatterplot which has one variable on the x-axis, one on the y-axis and a point
for each case in your dataset. If one variable is explanatory and the other is
outcome, it is a very, very strong convention to put the outcome on the y
(vertical) axis.
In a scatterplot we can increase the dimensionality with things like marker
size, colour, shape etc…but don’t go too far or you will simply overload the

viewer

Clustering and Dimensionality


Reduction
Clustering
Finding natural groups in data
Automatically identifying what is common between data points.

Dimensionality reduction
Transforming a high dimensional space to a low dimensional one.
For visualisation
For feature selection
Can help combat the curse of dimensionality

K-means clustering
simplest clustering approaches and the limitations include:
Assumes spherical distributions
Assumes equal cluster sizes
Requires an estimate of k
A hard estimation method – each point belongs to one and only one cluster

Distances
K-means relies on a Euclidian distance
But there are many other ways that we can measure distance
We need to take a quick look at some of these
Where are they useful where are they problematic

What is Euclidian distance?


The simplest distance metric
It deals with the distance between vectors x and y with i dimensions
The Euclidean distance is simple to calculate and can be useful.
However one must be careful that all units of each dimension have been
normalised, or it will skew the distance.
Also the Euclidean distance tends to become less useful as we move into
spaces with dimensionality much greater than three.
What is Cosine distance?
The cosine of the angle between x and y
Overcomes some of the difficulties encountered by the Euclidean distance in
higher dimensions
Only considers directions of vectors and not their magnitude

What is Manhattan Distance?


The distance between two points if they coulf only move at right angles
Rather similar to the Euclidean distance.
However, the Manhattan distance does not suffer as badly as Euclidean
distance in higher dimensions.
As with Eucildean distance, care should be taken to normalize all dimensions

before calculating the Manhattan distance.

What is Minkowski Distance?


A generalisation of Manhattan and Euclidean distances

Minkowski distance allows for a great deal of flexibility


It is also advisable to explore the simpler metrics first and develop a feeling
for how the choice of p might affect the performance of the distance metric
that you choose

Scaling vectors
In several cases above we mentioned normalizing degrees of freedom
This is generally VERY important in machine learning
We should ensure that all dimensions are of a similar scale
Otherwise arbitrary choices of unit could show up as important trends in data
We can rescale, standardize or normalize

Rescaling - MinMaxScaler
Subtracts the minimum value in the feature and then divides by the range
MinMaxScaler preserves the shape of the original distribution. It doesn’t
meaningfully change the information embedded in the original data.
Note that MinMaxScaler doesn’t reduce the importance of outliers.

Rescale data - RobustScaler


Transforms the feature vector by subtracting the median and then dividing by
the interquartile range (75% value — 25% value)
Note that the range for each feature after RobustScaler is applied is larger
than it was for MinMaxScaler.
Use RobustScaler if you want to reduce the effects of outliers, relative to
MinMaxScaler.

Rescaling – StandardScaler
Standardizes a feature by subtracting the mean and then scaling to unit
variance
Unit variance means dividing all the values by the standard deviation
StandardScaler is the industry’s go-to algorithm

Clustering algorithms – Gaussian Mixture Model

A Gaussian Mixture is a function that is comprised of several Gaussians, each


identified by k ∈ {1,…, K}, where K is the number of clusters of our dataset.
Each Gaussian k in the mixture is comprised of the following parameters:
A mean μ that defines its centre.
A covariance Σ that defines its width. This would be equivalent to the
dimensions of an ellipsoid in a multivariate scenario.
A mixing probability π that defines how big or small the Gaussian function
will be.

GMM expectation maximisation

Start with an initial set of parameters


Repeat until convergence:
Use parameters to calculate latent variables of the model
Use the latent variables to obtain new optimal parameters

GMM - difference from k-means

GMM assigns a probability of belonging to a cluster , k-means assigns only 1


or zero
GMM can handle different shapes of cluster, depending on how free we allow
the covariance matrix to be
GMM is generally more expensive but more nuanced.

Curse of dimensionality
As the dimension of the data increases, the volume causes data to be sparse
Problems with statistical significance
Sparsity can cause dissimilarity

Reducing dimensionality

Dimensionality reduction generally involves finding a smaller set of dimensions


that preserve as much information about the data as possible

Principal component analysis


Take all the factors in the original data and use it to form new factors which
are:
uncorrelated with one another
ranked in order of importance
The steps of PCA:

Step 1: Standardise the data and transform all dimensions to zero mean and
unit variance

z = value - mean/ standard dev

Step 2: Set up a covariance matrix

Step 3:
Obtain the eigenvectors of the covariance matrix
The eignevectors provide us with a new set of vectors which tell us how
much that eigenvector explains the original data

Step 4: Choose how many components to be used and the remaining


vectors is called the feature vector

Step 5: Recast the original data


the feature vector formed using the eignevectors of the covariance matrix
to reorient the data from the original axes to the ones represented by the
principal components

Other ways to reduce dimensionality:

Non-negative matrix factorization


Stochastic methods

Eignevectors and eigenvalues


Eigenvectors are a special set of vectors associated with a linear system of
equations (i.e., a matrix equation)
In our case the linear equations are nothing but the original data features
Eigenvalues are the weights associated with each eigenvector
There are numerous methods for obtaining the eigenvectors and eigenvalues

of a given matrix
Linear Regression & Naive Bayes
Classification
Linear regression

Establish if there is a relationship between two variables


Forecast new observations
The least squares method
How to find the linear equation that best fit the data

Within this, there are variables where:

the dependent variable - values depend on another variable


independent variable - independent
y = mx + c is the normal form but the model does not fit into the data so
errors need to be taken into account by looking at the residuals term. errors
can also rise because of noise

Minimising the errors

focused on the squared errors


this is usually done by calculating the coefficinets
getting the mean values of the data and calculating the slope and intercept

Measuring the quality of the regression:


1. Sum of squared errors

pro - works to compare models for the same data


Cons – not easy to interpret – scales with the number of data where it gives
a squared number
2. Root mean squared error

Divide by number of samples; no longer depends on data size


Take square root so it is in the same dimensions as the underlying data
smaller than the average size of data then it is good and vice versa
3. R squared value

Denominator is the SSE that results if we predict every y value to be the


average of the y in the data
Captures value added by using a model
r2 = 0 – no value
r2 = 1 – perfect model
Allows comparison across models and data
It is unitless
Beware – for hard problems even a good model can have r2 close to zero

Multiple linear regression


can have more than one independent vairable
same procedure as the linear regression above
use r2 values to decide if the model has improved with the addition of new
independent variables

Classification with Naive Bayes


Example:

we have some base distribution that we know of the labels but we want to be
able to predict the labels for some data which is not in this set

Terminology:
1. Posterior

2. Prior - what is the probability the data belongs to y without any other info

3. Likelihood - opposite to the posterior

Now, with all of this, we can compare the classes


Probabilities are multiplicative

When introducing a new data point, we can put them to one dimensional
marginal distributions first then from there, determine what we will be
calculating

Setting the priors

This is what we believe about the data before making any measurement
So here since it is a new data point, we know nothing about x1 and x2 of a
point - how likely is it going to belong to y0 and y1
If we look at the training data ½ belongs to y0 and ½ to y1, so it is reasonable
to say that a new point is 50:50 in the absence of any information
Therefore p(y0)/p(y1) = 0.5/0.5 = 1

We can calculate the log likelihoods for the various terms in the previous
equation.
It turns out that the log likelihood of belonging to the red distribution is ~ -11
and that for the blue distribution is ~ -4. So the blue distribution is more likely
It is interesting to note that the biggest difference was for the x_2 value
Modelling Data
Machine learning
computer systems which are able to learn and adapt without followinf explicit
instructions
The numerical simulation always follows the same algorithm and always gives
the same outcome (ML updates the model on the basis of the data observed)
ML starts with a core algorithm and some data and then updates parameters
withing the core algorithm to best represent the data observed
it is essentially representation + evaluation + optimisation

Supervised ML

Data plus labels


Learning a function that maps an input to an output based on example
inputoutput pairs.

Unsupervised ML

Data do not have labels


Identifying trends in unlabelled datasets
E.g. cluster analysis, is used for exploratory data analysis to find hidden
patterns or grouping in data

Classification

Identifying to which of a set of categories a new sample belongs, on the basis


of a training set
E.g. spam filter or which crystal structure gives a certain pattern

Regression

Models a target prediction value based on the independent variables


Linear regression is a classical method whereas neural network type models
are deep methods
Evaluation/model selection
Parameters and hyperparameters

Parameters - properties of the mdoel which are modified during training


Hyperparameters - set of values that define the model and how it trains but it
is not to be updated during training

Features

When machine learning approaches the data, it will consist of several or more
features which are simply input variables for the model
A feature is a mesurable quantity of something that is observable
Models need features to learn
examples would include the identification and differentiation of a cat and a
car
Feature engineering includes transforming raw data into features that better
represent the underlying problem. It makes inputs into things that an
algorithm can understand. But sometimes som inputs are not algorithm
ready so we need to convert them into something useful. This is where one
hot encoding comes into play

One hot encoding

It is for classification porblmes where the vector of length is the same as the
number of categories.
Each element is the probability that the data represents a given class

Optimisation/Evaluation
1. Evaluation
objective function or scoring function which distinguishes the good from
bad models. it must always represent the goodness of a model in a single
number

Evaluation metrics

1. Mean squared error


used in regression since the square endures a single minimum, avoids local
minima trapping and is easy to calculate
2. Mean absolute error

this is similar to mean squared error but there is no quadric term.


it is more robust to the outliers but MSE penalises large differences more
than MAE
3. Huber loss

quadratic is close to the minimum where if it is linear, it is far from the min
it is more exp to calculate but it overcomes the problems of MSE and MAE
4. Cross entropy

it is used for classification problems and tells us how similar our model
distribution is to the true distribution
penalises all error but more particularly those which are most inaccurate
5. Hinge loss

used for classification


it does not seek to reproduce the distribution of data

Test and validation sets


The model must always be validated on the data that is not used for testing.
Most of the time, only 20& of the data is used for validation.
Need to make sure that the validation and training distributions are the same
Neural Networks & Deep Learning
History of deep neural nets (DNN)
originally a device which was intended for binary classification
it produces a single output from a matrix of inputs of weights and biases
nueral networks are of a single layer
what was key to NN:
back propagation algorithm which was in place to reduce error
the gradients could be used to minimise error
modifications back propagate through the netowkr using the chain rule

Layers of a DNN
it is a multi layer perceptron

it is also called fully connectd layers

There are two ways to program a NN (tensorflor/keras)

sequential: quick and easy


functional: more complex but flexible

Activation functions
1. Linear: simplest form of activation function

2. Sigmoid

Vanishing gradient problem


Secondly , its output isn’t zero centered. It makes the gradient updates go
too far in different directions. 0 < output < 1, and it makes optimization
harder.
Sigmoids saturate and kill gradients.
Sigmoids have slow convergence.

3. Tanh

Output is zero centered


Usually preferred to sigmoid as it converges better
Still it suffers from vanishing gradient problem

4. ReLU

6 times improvement in convergence from Tanh function


Should only be used within Hidden layers of a neural network model
most used

5. LeakyReLU
Some ReLu gradients can be fragile during training and can die.
Cause a weight update which will makes it never activate on any data point
again.
ReLu could result in Dead Neurons.

Backpropagation

Optimisation
1. First order

Minimise the gradient of the loss function with respect to the parameters
Relatively quick, but ignores curvature

2. Second order

Calculate the second derivative of the loss function with respect to


parameters
Slower per step, but includes curvature so can be quicker overall

3. Stochastic gradient descent

Gradient descent – calculate the gradient of the loss of the entire set with
respect to parameters
SGD – calculated per sample rather than on the entire batch.This is much
quicker to calculate, but can lead to high variance
Mini-batch SGD – calculate loss gradient on batches of set size which is
essentially best of both worlds

4. Momentum

High variance oscillations in SGD makes it hard to converge


Momentum softens the oscillations in irrelevant directions

5. Nesterov

Momentum still has problems


M is high even close to the minimum, so we often overshoot
Nesterov accelerated gradient jumps out on the momentum direction, and
estimates a correction to updated the parameters

6. Adaptive Methods

Some parameters update much more often than others


Therefore different learning rates can be appropriate for different parameters
Adagrad modifies the learning rate η at each time step for every parameter
based on the past gradients computed for that parameter

7. AdaDelta
Adagrad suffers because the gradients from all previous steps are
accumulated, so the learning rate continuously decays
AdaDelta circumvents this by storing gradients only from n previous steps

8. Adam

Similar to AdaDelta
Add in information about the mean of the momentum of previous steps too
Works very well in most situations

Regularisation

You might also like