Data Science Notes
Data Science Notes
Machines
Support Vector Machine
Supervised ML which is a non linear machine learning method (it can capture
more complex data) where with the given data, it finds the best hyperplanes
to categorise new examples
Hyperplanes
linearly separable
non linearly separable
Choosing a hyperplane:
2. The constraint:
need to ensure that no point is classified on the wrong side of the line
3. The optimisation
We can find the hyperplane that maximises the distance x2 – x1, which is
the distance between the support vectors.
However, even with this, not all data will fall onto the correct side of any line so
there is a need to relax the contraint and this is called soft margin SVM
Types of kernel
Pros:
It works really well with a clear margin of separation
It is effective in high dimensional spaces.
It is effective in cases where the number of dimensions is greater than
the number of samples.
It uses a subset of training points in the decision function (called support
vectors), so it is also memory efficient.
Cons:
It doesn’t perform well when we have large data set because the
required training time is higher
It also doesn’t perform very well, when the data set has more noise i.e.
target classes are overlapping
SVM doesn’t directly provide probability estimates, these are calculated
using an expensive five-fold cross-validation. It is included in the
related SVC method of Python scikit-learn library.
Decision trees
it is an intuitive algorithm
information is:
Simply the difference between the entropy before and after splitting.
If information is high entropy is almost zero if information is low entropy is
almost 1
So we want to find the combination of splits and thresholds that maximises
the information gain over the tree
As a whole:
Pros:
Computationally cheap to use, easy for humans to understand results and it
can deal with irrelevant features also
Cons:
Prone to Overfitting.(It refers to the process when models is trained on
training data too well that any noise in testing data can bring negative
impacts to performance of model.)
Boosting + Bagging
To overcome the limitations of a weak learner we can use booting or bagging.
Both methods use an ensemble of weak learners to build a strong learner
Boosting – choose next learner based on the errors of the last learner
(gradient boosted decision trees)
Bagging – stochastically choose next learners (random forests)
Exploratory Data Analysis
Exploratory Data Analysis
method of looking at data that does not include formal statistical modelling
and inference
Classes of EDA:
Univariate graphical
Univariate Non graphical
Multivariate graphical
Univariate graphical
looking at a single value from an experiment and getting an idea about the
distribution of the value
2. Central tendency
Mean - the common and useful measures are the arithmetic mean, median
and mode. There are other means such as geometric, harmonic, turncated
or Winsorized means
Median - the middle value after all values are in an ordered list. for
symmetric distributions, the mean and median coincide
3. Spread - how far away from the centre we are still likely to find data values.
The standard deviation is the square root of the variance
Variance and Standard deviation
Variance is an important property that they are additive for any number
of different independent sources of variation
Standard deviation has the same units as the original data
Inter-quartile range
The IQR is a more robust measure of spread than the variance or
standard deviation.
The IQR is not affected by extreme outliers as strongly (if at all).
Percentiles
more flexible version of quartiles
4. Skew - measure of asymmetry
2. Boxplots
Boxplots are very good at presenting information about the central tendency,
symmetry and skew, as well as outliers, although they can be misleading
about aspects such as multimodality
3. Outliers
The term “outlier” is not well defined in statistics, and the definition varies
depending on the purpose and situation. The “outliers” identified by a boxplot,
which could be called “boxplot outliers” are defined as any points more than
1.5 IQRs above Q3 or more than 1.5 IQRs below Q1.
4. Violin plots
A violin plot is like a box plot, which shows peaks in the data. It is used to
visualize the distribution of numerical data. Unlike a box plot that can only
show summary statistics, violin plots depict summary statistics and the
density of each variable.
2. Correlation of variables
Graphical multivariate
1. Univariate plots by category
subjects.
2. Scatterplots
For two quantitative variables, the basic graphical EDA technique is the
scatterplot which has one variable on the x-axis, one on the y-axis and a point
for each case in your dataset. If one variable is explanatory and the other is
outcome, it is a very, very strong convention to put the outcome on the y
(vertical) axis.
In a scatterplot we can increase the dimensionality with things like marker
size, colour, shape etc…but don’t go too far or you will simply overload the
viewer
Dimensionality reduction
Transforming a high dimensional space to a low dimensional one.
For visualisation
For feature selection
Can help combat the curse of dimensionality
K-means clustering
simplest clustering approaches and the limitations include:
Assumes spherical distributions
Assumes equal cluster sizes
Requires an estimate of k
A hard estimation method – each point belongs to one and only one cluster
Distances
K-means relies on a Euclidian distance
But there are many other ways that we can measure distance
We need to take a quick look at some of these
Where are they useful where are they problematic
Scaling vectors
In several cases above we mentioned normalizing degrees of freedom
This is generally VERY important in machine learning
We should ensure that all dimensions are of a similar scale
Otherwise arbitrary choices of unit could show up as important trends in data
We can rescale, standardize or normalize
Rescaling - MinMaxScaler
Subtracts the minimum value in the feature and then divides by the range
MinMaxScaler preserves the shape of the original distribution. It doesn’t
meaningfully change the information embedded in the original data.
Note that MinMaxScaler doesn’t reduce the importance of outliers.
Rescaling – StandardScaler
Standardizes a feature by subtracting the mean and then scaling to unit
variance
Unit variance means dividing all the values by the standard deviation
StandardScaler is the industry’s go-to algorithm
Curse of dimensionality
As the dimension of the data increases, the volume causes data to be sparse
Problems with statistical significance
Sparsity can cause dissimilarity
Reducing dimensionality
Step 1: Standardise the data and transform all dimensions to zero mean and
unit variance
Step 3:
Obtain the eigenvectors of the covariance matrix
The eignevectors provide us with a new set of vectors which tell us how
much that eigenvector explains the original data
of a given matrix
Linear Regression & Naive Bayes
Classification
Linear regression
we have some base distribution that we know of the labels but we want to be
able to predict the labels for some data which is not in this set
Terminology:
1. Posterior
2. Prior - what is the probability the data belongs to y without any other info
When introducing a new data point, we can put them to one dimensional
marginal distributions first then from there, determine what we will be
calculating
This is what we believe about the data before making any measurement
So here since it is a new data point, we know nothing about x1 and x2 of a
point - how likely is it going to belong to y0 and y1
If we look at the training data ½ belongs to y0 and ½ to y1, so it is reasonable
to say that a new point is 50:50 in the absence of any information
Therefore p(y0)/p(y1) = 0.5/0.5 = 1
We can calculate the log likelihoods for the various terms in the previous
equation.
It turns out that the log likelihood of belonging to the red distribution is ~ -11
and that for the blue distribution is ~ -4. So the blue distribution is more likely
It is interesting to note that the biggest difference was for the x_2 value
Modelling Data
Machine learning
computer systems which are able to learn and adapt without followinf explicit
instructions
The numerical simulation always follows the same algorithm and always gives
the same outcome (ML updates the model on the basis of the data observed)
ML starts with a core algorithm and some data and then updates parameters
withing the core algorithm to best represent the data observed
it is essentially representation + evaluation + optimisation
Supervised ML
Unsupervised ML
Classification
Regression
Features
When machine learning approaches the data, it will consist of several or more
features which are simply input variables for the model
A feature is a mesurable quantity of something that is observable
Models need features to learn
examples would include the identification and differentiation of a cat and a
car
Feature engineering includes transforming raw data into features that better
represent the underlying problem. It makes inputs into things that an
algorithm can understand. But sometimes som inputs are not algorithm
ready so we need to convert them into something useful. This is where one
hot encoding comes into play
It is for classification porblmes where the vector of length is the same as the
number of categories.
Each element is the probability that the data represents a given class
Optimisation/Evaluation
1. Evaluation
objective function or scoring function which distinguishes the good from
bad models. it must always represent the goodness of a model in a single
number
Evaluation metrics
quadratic is close to the minimum where if it is linear, it is far from the min
it is more exp to calculate but it overcomes the problems of MSE and MAE
4. Cross entropy
it is used for classification problems and tells us how similar our model
distribution is to the true distribution
penalises all error but more particularly those which are most inaccurate
5. Hinge loss
Layers of a DNN
it is a multi layer perceptron
Activation functions
1. Linear: simplest form of activation function
2. Sigmoid
3. Tanh
4. ReLU
5. LeakyReLU
Some ReLu gradients can be fragile during training and can die.
Cause a weight update which will makes it never activate on any data point
again.
ReLu could result in Dead Neurons.
Backpropagation
Optimisation
1. First order
Minimise the gradient of the loss function with respect to the parameters
Relatively quick, but ignores curvature
2. Second order
Gradient descent – calculate the gradient of the loss of the entire set with
respect to parameters
SGD – calculated per sample rather than on the entire batch.This is much
quicker to calculate, but can lead to high variance
Mini-batch SGD – calculate loss gradient on batches of set size which is
essentially best of both worlds
4. Momentum
5. Nesterov
6. Adaptive Methods
7. AdaDelta
Adagrad suffers because the gradients from all previous steps are
accumulated, so the learning rate continuously decays
AdaDelta circumvents this by storing gradients only from n previous steps
8. Adam
Similar to AdaDelta
Add in information about the mean of the momentum of previous steps too
Works very well in most situations
Regularisation