Applied Data Analysis
Applied Data Analysis
Bence Bogdandy
August 5, 2020
2
Contents
1 Data Science 7
1.1 What is Data Science 7
1.2 Tasks and Challenges . . . 8
1.3 Application Areas . . . 8
1.3.1 Engineering . . . 8
1.3.2 Finance . . . 9
1.3.3 Research . . . 9
1.3.4 Other Fields . . . 9
1.4 Participants 10
1.4.1 Mathematicians . . . 10
1.4.2 Computer Scientists 10
1.4.3 Domain Experts . . . 10
1.5 Technical Report . . . 11
3
4 CONTENTS
3 Descriptive Statistics 45
3.1 Randomness . . . 45
3.2 Statistical Data Types . . . 46
3.2.1 Categorical 47
3.2.2 Numeric . . . 49
3.3 Statistical Measurements . . . 51
3.3.1 Quantile 52
3.3.2 Variance 52
3.3.3 Standard Deviation . . . 52
3.3.4 Covariance 52
3.3.5 Correlation 53
3.3.6 Skewness 53
3.3.7 Kurtosis 54
3.4 Distributions . . . 54
3.4.1 Bernoulli Distribution 54
3.4.2 Uniform Distribution . . . 57
3.4.3 Binomial Distribution 59
3.4.4 Gaussian Distribution 62
3.4.5 Poisson Distribution 64
3.4.6 Exponential Distribution ... 65
4 Data Mining 71
4.1 Data Structures 71
4.1.1 Data set 73
4.2 Data Mining . . . 76
4.3 Data Mining Process 77
4.3.1 Problem definition 77
4.3.2 Data Selection . . . 77
4.3.3 Data Loading 77
4.3.4 Principal Component Analysis 80
4.3.5 Singular Value Decomposition 83
4.3.6 Feature Engineering 85
4.3.7 Feature Crosses . . . 86
5 Machine Learning 89
5.1 Supervised Learning 90
5.2 Training and Loss . . . 90
5.2.1 Train-Test split 91
5.2.2 Linear Regression . . . 92
5.2.3 Logistic Regression . . . 96
5.3 Training Process . . . 100
5.3.1 Perceptron . . . 100
5.3.2 Neural Networks . . . 104
5.4 Model Optimization Methods . . . 108
5.4.1 Optimization Process . . . 109
5.4.2 Gradient Descent . . . 112
5.4.3 Prediction 118
5.4.4 Overfitting . . . 119
5.5 Unsupervised Learning 121
CONTENTS 5
Data Science
The acquisition of data and information has always been a very important concept in human
civilization. Collection, transformation and manipulation of accumulated data enables the brain to
make complex decisions. Although these processes are acquired naturally, the brain is only capable
of processing small amounts of data at a time.
Machines have always been used to help humans and increase productivity. Computers have been
invented in order solve calculations effectively. Data science is a modern extension of the human
thinking process. The Latent potential of massive scale data collection has been quickly realized
by companies and researchers. Since the beginning of the information age, enormous amounts of
data has been accumulated.
With the help of Data Science, hidden truths and information that be uncovered among the vast array
of data.
The process of gathering information is only partly contained in the field of data science. These
processes change as technology evolves, and are mostly a by-product of various modern
technologies. After the information has been obtained it is usually formatted and visualized. New
knowledge and information can be derived from the newly formatted data set, which is then used
to analyze the data set, or machine learning models.
The abstract processes described above are made up of descriptions, mathematical formulae and
algorithms.
7
8 CHAPTER 1. DATA SCIENCE
1.3.1 Engineering
Modern production generates large amounts of data besides the products themselves. This surplus
of data can be processed in order to extract useful information related to the production. Knowledge
is used in order to optimize the manufacturing process.
Certain attributes of a specific production system can influence the resulting product. These
attributes can be used to create recommendations for engineers. Engineers use these recom
mendations to fine-tune the manufacturing process, in order to create specific variations on the
outcome.
A new field of machine learning called Generative modelling has recently gained ground as one
of the most promising fields of machine learning engineering. Generative models take an input,
1.3. APPLICATION AREAS 9
and try to produce a slightly different output. For example, a given 3 dimensional data of a tool, or
part, could be optimized by removing unnecessary structural elements.
1.3.2 Finance
The finance industry was one of the first to start collecting large-scale data sets, which support the
backbone of most modern day financial decisions. Of course, in the finance industry the end goal
is always the biggest monetary gain. Therefore, finance has always been one of the biggest
supporters, and job providers of data science. Data is valuable in the financial world; it provides
opportunity to improve one’s financial position on the market. Early data scientists were mostly
statisticians, or financial mathematicians, who specialized in analyzing, visualizing data or predicting
certain outcomes.
Modern banking systems employ a large number of data scientists who are capable of creating
complex models for outlier detection, and intelligent predictions. Recently, data science applications
have become so adept at managing huge amounts of data and providing reliable information that
it has replaced a huge number of human responsibilities. This is, of course, a huge boon for the
financial world, as models are capable of providing recommendations and information on data that
could have not been analyzed before.
1.3.3 Research
The process of observation is a data collection method in itself. The process of capturing this
information has always been an important aspect of science. The methods of saving these data
has been different throughout history. A few decades ago, writing down the results of certain
experiments were the only adequate solutions to successfully capture results and natural phe
nomena.
In modern times, technology is much more adept at capturing large amounts of data seamlessly
than human senses and memory. Many of the modern scientific fields have evolved in the past
10 years to more closely incorporate the data processing tools. Modern research often relies on
the modelling of huge data sets. In research, the application of the captured data can differ from
field to field, but the method of processing hardly changes. In most cases, data is used to either
prove, or disprove, given hypotheses using statistical methods or machine learning.. Medical
science has particularly benefited from data science applications. Modern machine learning
applications can surpass humans in disease detection, and can provide recommendations for
medical doctors. These kinds of technologies increase public welfare, and decrease the stress
and point of errors for medical practitioners .
the last century would not have even dreamed of. It is capable of measuring numerous input data
in order to streamline the user’s experience These applications can be trained to remember, and
suggest, actions based on past behaviors, mirroring intelligence.
1.4 Participants
As data science is an interdisciplinary fielda solution to a data science problem is rarely done by a
single contributor. Multiple different roles are set in order to effectively find the solution to a
specific problem.
In an industrial environment, there is usually a dedicated research and development team who
performs these tasks. The particular roles and participants of a data science group are discussed
in this section. In this lecture note we will focus on the job of the computer scientist, or data
scientist.
1.4.1 Mathematicians
Mathematicians are utilized in order to analyze the data. This analysis reveals underlying attributes
and details. This newly revealed knowledge can be used in order to create mathematical models
which fit the particular problem.
Determining the correct model can also be the mathematician’s role. The model to be used
usually requires specific modifications to the data set. Determining these transformations can
effectively create processable data, as well as models.
or computer algorithm can’t be checked easily. Domain experts have the knowledge required to
build consistent, and intelligent systems in their own fields.
As the created models and systems often require knowledge of certain fields, domain experts
often evaluate the inner workings of the algorithms. Without a way to check whether or not an
algorithm or system is working correctly, the application cannot be proven to be correct.
Question 1.
What fields of science contributed to the creation of modern data science?
Question 2.
When was modern data science created?
Question 3.
What kind of tasks does a data scientist have?
Question 4.
In which industries is modern data science incorporated into the development process?
Question 5.
How does data science increase the productivity of engineering systems?
Question 6.
Why are mathematicians an essential part of the data science process.
12 CHAPTER 1. DATA SCIENCE
Question 7.
What essential participants and roles does data science have?
Question 8.
Which field of applied science helped a specific part computer science evolve into data science?
Question 9.
Who is the domain expert in the data science process, and what responsibilities does the job
have?
Question 10.
Why is it important to create a technical report of the data science process?
Question 11.
What do you think are some types of technical reports used in the industry, economics or sci
ence?
Chapter 2
Most software for analyzing and processing data have been in development before the data science
boom of the 2010s. Although these types of software are still relevant today, a great number of
different tools have been developed since [10, 22]. These tools are designed to be used on very large
data sets that older software tools might not be able to handle.
Nevertheless, there are a number of different software which can be used to effectively analyze and
process large amounts of data and build intelligent models.
2.1.2 Limitations
While it is a perfectly viable option for tacking easy-to-handle, well-formatted and generally
accessible data sets, the limitations of the tool start to show as the complexity increases.
Excel does support the import of csv data sets, which is usually the format of choice in data
sciences. However, depending on the running hardware, Excel might be unable to load larger
datasets, crashing in the process.
13
14 CHAPTER 2. DATA PROCESSING TOOLS
Complex data analyzing, and pre-processing algorithms are supported by third party libraries,
or by programming it in visual basic. However, as Excel is unable to visually represent complex
and large data sets, these algorithms might prove to be ineffective, slow and confusing.
In conclusion, Excel is a perfect tool to calculate statistical values on smaller data sets, provides
easy access to simple data transformations on data, and is capable of converting data to different
formats.
However, Excel might be ineffective on more complex data sets, and data transformation tasks.
2.2.1 Matlab
Matlab was developed to be used as a combination of proprietary programming language and
environment. The language contains high-level representations of numerical data types, such
as arrays and matrices. Operations between these data types are integrated into the language,
letting the user focus on the creation of new applications without reinventing the wheel.
Matlab supports 2, and 3 dimensional data visualization with plotting, charts and heat maps
among other tools. Matlab can be easily extended with optional toolboxes, which contain imple
mentations for a general, or specific field of computing. The programming language is weakly
typed, and supports multiple programming paradigms such as structured, and objective ori
ented programming.
The most problematic aspect of Matlab is that its programming environment and language don’t
have a free option, severely limiting availability.
2.2.2 R
Much like Matlab, R is a programming language and tool set developed for mathematical analysis
and computing first. The difference is that R is an open source and is part of the GNU Project.
R is specifically aimed at statistical computing.
R contains high level statistical computing capabilities, including the Data Frame type, which has
spread as the data type for choice since its inception. R contains tools for descriptive statistics, linear
and non-linear modelling and basic machine learning capabilities. One of the focus during R’s
development was the well-realised plotting capability, which can be used to create figures for
scientific use.
2.3 Python
The Python programming language, developed by Guido van Rossum first appeared in 1990. The
language began gaining attention after Python 2.0 was released in 2000.
2.4. PACKAGE MANAGEMENT 15
Python is an interpreted script language, which is capable of running compiled programs, or even
compiling them in runtime with extensions. The language was developed with the men tality of
being as easy to understand and straight forward as a programming language can possibly be.
It supports different programming paradigms, and therefore it is called a multi-paradigm
programming language.
Structured programming is supported, as well as the creation of classes, and other elements of
the object oriented programming paradigm. Other paradigms are partly supported, namely
functional, aspect-oriented, and logic programming.
Python programs is designed to be ran in a python virtual environment. The virtual environment
provides garbage collection, dynamic typing among other features.
The Python language philosophy is called the Zen of Python.
Example 2.3.1. The following code snippet shows the initialization of variables of various types:
[1]: variable1=12
variable2="Hello!"
variable3=False
Python is dynamically typed; therefore no explicit declaration of the variable’s type is needed. In the
following snippet, all three variables are printed onto the standard output:
12 Hello! False
2.4.1 pip
Pip is a general package, and dependency management tool for the Python programming lan
guage. Packages can be installed into interpreters by using the pip install command in the ter
minal.
If the required package is found, it is downloaded, and installed into the currently active python
interpreter.
It should be noted that multiple different interpreters can be created using the python virtual
environment. When creating these environments, a separate instance of the interpreter is cloned
using a default version of python.
16 CHAPTER 2. DATA PROCESSING TOOLS
The code snippet found in Listing 2.1 will create a Python 3 virtual environment called env.
A script named activate is usually created in the bin directory of the environment. The interpreter can be
activated by using source activate on this file.
1 source env/bin/activate
2 pip install numpy, pandas
On Listing 2.2, the activation of the newly created virtual environment, and the installation of the
Numpy and Pandas packages can be seen.
2.4.2 Anaconda
Anaconda is a combination of a python interpreter, a package manger, and a handful of select
packages, and other useful tools. The Anaconda distribution is intended to be used for data
science and machine learning.
The package manager in this distribution is called the conda. This package manager essentially fills
the same role as pip, with the added feature of environment management. The two package managers
do not use the same repositories to install packages. Therefore, some packages might be
unavailable, or have a different version for the two package managers.
1 conda create --name myenv
2 source myenv/bin/activate
3 conda install numpy, pandas
Listing 2.3: Conda Package Manager
The operations on Listing 2.3 accomplish the same goal as running the code snippet on Listing
2.2 and Listing 2.1 with the conda package manager.
The development packages and tool contained in anaconda will be explored in the following
sections.
2.5.1 PyCharm
Pycharm is one of the most recognized development platforms for python. It is recommended by
Anaconda as it’s platform of choice. PyCharm is a professional platform, capable of state of-the art
code completion, code generation and other tools you might expect from a modern IDE. However,
as it contains a lot of different tools and components, it also means that it takes a large overhead in
memory, and storage space.
PyCharm can provide almost any tools a developer might require for developing comprehensive
python scripts. It provides interactivity though its emulated terminal, and is able to display plots
interactively. It is also capable of handling different technologies, such as database tools, and web
frameworks. Therefore, PyCharm is an excellent tool for developing scripts that are designed to be
run on a server.
Jupyter can be run locally, or hosted on a server, which can be accessed from remote locations.
Notebook interfaces have been implemented by most technological companies such as Google and
Microsoft. These companies provide their own servers on which developers can create notebooks,
and run their applications. For the aforementioned reasons, notebooks have become the development
interfaces for creating data science applications.
However, python scripts are still an important part of the ecosystem, as they are easy to run on
remote servers without supervision.
2.6.1 numpy
Numpy [14] is the most essential scientific computing package for python. Numpy contains
multiple implementations of multi-dimensional arrays and matrices. The implementation of
these data structures is highly optimized for efficient computation. Numpy also contains high -
level mathematical algorithms. The library contains functions for linea r algebra, random num
ber generation among other subjects. Most of the modern data science-focused packages use
the Numpy as a base for numerical calculation. Numpy provides a basis on which scientific
computing exists in Python, and therefore it is essential to learn to use it effectively.
Example 2.6.1. In the following blocks, functionalities of Numpy will be represented through basic
exercises. In the following block, numpy is imported as np. A numpy array is stored in the
variable arr, which is converted from an ordinary python list.
[3]: arr
The following presents built-in functions of the numpy array data structure. The available
functions are much more numerous, than the shown examples.
The following blocks shows basic matrix data structure creation, and basic built-in-functionality.
[7]: matrix=[Link]([
[1,2,3],
[5,4,6],
[9,8,7]
])
[8]: matrix
[10]: matrix.T
[3, 6, 7]])
[12]: [Link]()
2.6.2 matplotlib
MatPlotLib [19] is the essential plotting library for Python. Much like Numpy, other packages
build on the given functionality of the library.
MatPlotLib is easy to use, and configure. It can be parametrized for highly customized plots. It is
capable of producing animated, and interactive plots as well.
MatPlotLib was designed to be used for publications, and therefore contains implementations
regularly used plotting and visualization methods.
A very important part of the matplotlib package is the collection of methods called pyplot.
Pyplot enables the easy function plotting and functionalities found in MatLab.
Example 2.6.2. In these examples, we’ll take a quick look at some basic functionality of the
matplotlib library.
We’ll start with importing both numpy, and matplotlib. As mentioned in the previous section,
numpy provides much of the basic functionalities of higher level scientific frameworks and li
braries.
The command x=[Link](1, 10, num=20) generates a linear vector between 1 and 10, containing 20
elements. The 20 elements are evenly spaced in the linear space.
The following commands create a linear line plot, where the x axis represents the indexes, and the
y axis represents the values of the contained elements. The library automatically connects elements,
and creates an easy to read, and understand line plot.
[3]: [Link](x)
[Link]('number indexes')
[Link]('number values')
[Link]()
20 CHAPTER 2. DATA PROCESSING TOOLS
The following two plots achieve the same result, except the plotted functions are logarithmic, and
exponential.
[Link](x), and [Link](x) is used to calculate the logarithmic, and exponential value of each element,
which are used during the plotting.
[4]: [Link]([Link](x))
[Link]('number indexes')
[Link]('logarithmic values ')
[Link]()
2.6. SCIENTIFIC PACKAGES 21
[5]: [Link]([Link](x))
[Link]('number indexes')
[Link]('exponential values')
[Link]()
22 CHAPTER 2. DATA PROCESSING TOOLS
2.6.3 pandas
While Numpy focuses on the algebraic calculations, Pandas sets its focus on the statistical data
structures and operations.
Pandas [12] uses Numpy as a base to create Series, and DataFrame data structures. A Series
corresponds to a series of data in statistical analysis. The elements contain values of a given type,
and are indexed along an axis. Numerous statistical operations can be called on the series easily, such
as filtering, mathematical operations, and sorting.
A DataFrame represents multiple Series indexed along the same axis. This data structure represents
individual measurements that correspond to each other in some way.
The resulting data structure is a two dimensional, tabular data set on which complex analysis can
be run. DataFrame comes with intuitive, and easy-to-use operations of which produce
descriptive statistics, aggregate data or transform the data in some way.
Tabular data, such as Excel, or CSV can be imported easily using built-in methods for DataFrame
conversion. Data can be also exported back to Excel, CSV and other formats such as SQL or
JSON.
Example 2.6.3. The following examples show the basic functionality and strengths of the panda’s
package.
After importing the required packages, a Series is created using [Link]() constructor, with a list as
a parameter. As you can see, the values are indexed starting from 0. The numpy data type of the
series can be seen on the bottom of the output.
2.6. SCIENTIFIC PACKAGES 23
[2]: 0 1
1 7
2 4
3 3
4 2
dtype: int64
In the following code blocks, a DataFrame is constructed by using the same matrix from the
numpy examples. Pandas constructs the DataFrame with three columns and three rows. The
names of the columns can be set by passing a list of column names onto the columns parameter.
[3]: df=[Link]([Link]([
[1,2,3],
[5,4,6],
[9,8,7]
]), columns=["First", "Second", "Third"])
df
[4]: [Link]
The DataFrame contains three series called First, Second, and Third of which are all of type int32.
A DataFrame can contain Series of different types, as expected from tabular data.
The following commands are used most often to examine the given data frame. the [Link]()
function returns the first few rows of data, which prevents pandas from filling up the screen with
data. [Link], and [Link] can be used to examine the different columns, and row indexes of the
data frame.
[5]: [Link]()
2 9 8 7
[6]: [Link]
[7]: [Link]
DataFrames and Series can be converted to numpy data easily, by using the built-in to_numpy()
function.
[8]: df_np=df.to_numpy()
print(df_np)
[[1 2 3]
[5 4 6]
[9 8 7]]
df.T returns the transpose of the tabular data, which swaps the rows for columns in the data
frame.
[9]: df.T
[9]: 0 1 2
First 1 5 9
Second 2 4 8
Third 3 6 7
[10]: df.sort_values(by='Third')
Data manipulation operations can be used easily in Pandas. You can either refer to columns by name,
or by index using the [Link][] method. Regular Python list comprehensions are present in Pandas,
therefore df[0:2] will return the first two rows of the DataFrame. df["Third"] will return the series by
the name of Third.
[11]: df["Third"]
[11]: 0 3
1 6
2 7
Name: Third, dtype: int32
2.6. SCIENTIFIC PACKAGES 25
[12]: df[0:2]
Data can be filtered using simple logical conditions. For example, df[df[’First’] > 2 will only
return rows where the values of the First Series contain a number higher than 2.
Data frame rows and columns can be extended easily with their respective functions. The
[Link]() function adds a new row to the existing data frame. The expression df[’Fourth’] =
[5, 8, 2] will add a new column called Fourth.
Different mathematical operations can be executed easily on data frames by using simple arith
metic. df - 4 will subtract -4 from every element of the data frame.
[15]: df - 4
The following block will transform by calculating the sinus values of each respective element.
[16]: ([Link](df))
As mentioned before, Pandas is used mainly for its implementation of the tabular data frame, and
it’s statistical capabilities.
26 CHAPTER 2. DATA PROCESSING TOOLS
The following code blocks contain examples of easy to use methods for statistical descriptions
of the data.
[17]: [Link]()
[18]: [Link]()
[19]: [Link]()
The pd.read_csv(’input’) can be used to read, and convert csv data into a pandas DataFrame. The input
can be a local file, or a link to raw csv data on the web.
[21]: iris_df=pd.read_csv('[Link]
↪→6f9306ad21398ea43cba4f7d537619d0e07d5ae3/[Link]')
[22]: iris_df
2.6.4 seaborn
Seaborn uses both Pandas and MatPlotLib in order to visualize statistical metrics of data. It
focuses on providing nice looking, and highly informative statistical visualization. Seaborn
provides error plots, categorical plots, and many other data set visualization methods. The
plotting functions can be called on data frames, with high customisability. The library is capable
of automatically inferring statistical information onto plots in order to produce informative
visualization.
Example 2.6.4. In this example, we’ll demonstrate a simple use-case for the seaborn package. As
mentioned above, the package contains functions specifically developed for statistical visualiza-
tion. In this example, we’ll use a pair plot to demonstrate the different varieties of parameters for
each class.
[3]: iris_df=pd.read_csv('[Link]
↪→6f9306ad21398ea43cba4f7d537619d0e07d5ae3/[Link]')
2.6.5 scikit-learn
Scikit-Learn [16] provides a wide range of tools for predictive data analysis and machine learning
tasks. Scikit was built on the foundations of Numpy, MatPlotLib, and Pandas.
Scikit-Learn is a collection of algorithms that are essential in data science. These algorithms can be
categorized into 6 groups:
1. Pre-processing
2. Dimensionality Reduction
3. Model Selection
4. Classification
5. Regression
6. Clustering
Each category contains numerous implementations to their respective problems.
Classification, and Regression algorithms contain numerous solutions to supervised learning
tasks. These include an easy to use and understand implementation of an Artificial Neural
2.6. SCIENTIFIC PACKAGES 29
Network. While this model can be parametrized with numerous different options, there is
currently no option to construct deep neural networks.
Clustering algorithms can be used to determine, and produce different classes for non-categorized
datasets. This is an unsupervised learning task, and therefore there is no explicit target output.
Model selection can be used to determine which model and parameters to use for a given task.
Model selection algorithms include heuristic search algorithms which aim to find a well
performing models in a hyper parameter space.
Pre-processing, and dimensionality Reduction algorithms might be the most widely used
categories of algorithms in the library. They provide easy-to-use functions, which can transform
data in order to be more easily processed by machine learning models. These algorithms include
various scaling methods among others, which can be essential for machine learning algorithms.
Although deep learning algorithms requires these functionalities, they mostly do not provide them
in their respective libraries. Therefore, a lot of different deep learning frameworks use scikit-learn
for its pre-processing, and dimensionality reduction tools.
Example 2.6.5. In the examples below, the previously read IRIS data set csv file will be processed.
Multiple different data transformation and machine learning techniques will be used in this small
example, which shows the compactness and ease of use of the library.
[2]: iris_df=pd.read_csv('[Link]
↪→6f9306ad21398ea43cba4f7d537619d0e07d5ae3/[Link]')
[3]: iris_df.head()
The following blocks separate the training, and the target data, and displays their head, as
described in the previous panda’s examples.
[5]: [Link]()
30 CHAPTER 2. DATA PROCESSING TOOLS
[6]: [Link]()
[6]: variety
0 Setosa
1 Setosa
2 Setosa
3 Setosa
4 Setosa
The following code block creates a correlation matrix of the training data. The pandas [Link]()
function can be used to calculate the correlation between elements. Seaborn is used to create the
representative figure of the correlation matrix.
[7]: [Link](figsize=(12,10))
cor = [Link]()
[Link](cor, annot=True, cmap=[Link])
[Link]()
2.6. SCIENTIFIC PACKAGES 31
The following block uses Scikit’s StandardScaler() class to scale the data by removing the mean, and
scaling to the variance. This is used in order to create a unified scale in order to better represent
individual values.
[8]: x_scaled=StandardScaler().fit_transform(x)
Principal Component Analysis is used to project the data onto a two dimensional plane. pc1
and pc2 represents the two principal components of the original data set.
[10]: [Link]()
2 -2.364229 -0.341908
3 -2.299384 -0.597395
4 -2.389842 0.646835
A LabelEncoder is used in order to encode the original labels of the dataset. The LabelEncoder
assigns a unique integer to every class of the original data. The y_num will contain integer
values of the classes in a list.
[11]: le = [Link]()
[Link](y)
le.classes_
y_num=[Link](y)
The commands in the following block are used to draw the results of the PCA values on a two
dimensional plane. The points are colored by their respective classes.
A decision tree is constructed using Scikit’s DecisionTreeClassifier() class. The scaled values, and the
numeric labels are trained in order to create the structure within the decision tree.
the plot tree() function is used to automatically create a visual representation of the trained
decision tree.
2.6.6 tensorflow
Tensorflow [1] is a state-of-the-art framework used in modern machine learning, and deep
learning applications. Tensorflow contains a low-level implementation of the tensor data type,
which is a high dimensional algebraic data type. A tensor denotes algebraic relationships be-
tween objects in a vector space. Tensors contain the general descriptions of every day datatypes
used by the computer.
1. A 0 dimensional tensor is a scalar, holding only a single value.
2. A 1 dimensional tensor is called a vector, which holds an array of values on a 1 dimen
sional axis.
3. A 2 dimensional tensor is called a matrix, which holds values on a 2 dimensional axes.
Any tensor dimension beyond the second dimension is usually called an "N-th dimensional
tensor", where N is the number of dimensions. Tensors are used heavily by graphics computing, and
data sciences. Numpy is an apt framework for linear computation, but lacks the nuances of higher
dimensional structure computing. For this reason, tensorflow was designed to be used in places
where algebraic computing of high dimensional structures is required.
Specifically, tensorflow was created in order to progress the state of neural network computing on
python, and other languages.
Example 2.6.6. In the following blocks, examples of tensors with different dimensions, and
different operations will be shown.
In order to create constants of various types, tensorflow requires the [Link]() constructor to
be used. Unlike previous frameworks, it is essential to explicitly state what the data type is.
2.6. SCIENTIFIC PACKAGES 35
Variables can be also created using [Link](). The difference between variables and constants is
that constants cannot be assigned a new value after instantiation, while variables can be
assigned a new value using [Link]().
In the following blocks, tensors of different shapes can be examined.
[Link](
[[1. 2. 3.]
[5. 4. 6.]
[9. 8. 7.]], shape=(3, 3), dtype=float16)
print(dim_3)
[Link](
[[[ 0 1 2 3 4]
[ 5 6 7 8 9]]
[[10 11 12 13 14]
[15 16 17 18 19]]
[[20 21 22 23 24]
[25 26 27 28 29]]], shape=(3, 2, 5), dtype=int32)
To cast a tensor into a numpy array, you can use the .numpy() function of the tensor.
[6]: dim_2.numpy()
36 CHAPTER 2. DATA PROCESSING TOOLS
Tensors can be added easily with the built in operators of tensorflow. These implementations
provide a highly optimized parallel implementation of common mathematical operations and
algorithms.
Mathematical operations can also be accessed with common mathematical notation.
[7]: a = [Link]([[1,2,3],
[5,4,6],
[9,8,7]],
dtype=tf.float16)
b = [Link]([3,3], dtype=tf.float16 )
[Link](
[[ 2. 3. 4.]
[ 6. 5. 7.]
[10. 9. 8.]], shape=(3, 3), dtype=float16)
[Link](
[[1. 2. 3.]
[5. 4. 6.]
[9. 8. 7.]], shape=(3, 3), dtype=float16)
[Link](
[[ 6. 6. 6.]
[15. 15. 15.]
[24. 24. 24.]], shape=(3, 3), dtype=float16)
[Link](
[[ 2. 3. 4.]
[ 6. 5. 7.]
[10. 9. 8.]], shape=(3, 3), dtype=float16)
[Link](
[[1. 2. 3.]
[5. 4. 6.]
[9. 8. 7.]], shape=(3, 3), dtype=float16)
2.6. SCIENTIFIC PACKAGES 37
[Link](
[[ 6. 6. 6.]
[15. 15. 15.]
[24. 24. 24.]], shape=(3, 3), dtype=float16)
The following block contains various operations executed on a matrix. The reduce functions can
be used to calculate their respective mathematical values, which is the maximum of the matr ix
in this example. reduce_sum() can be used to add elements of the matrix from different axes.
The [Link]() function returns the index of the largest element in each column. Different arg
functions can be used to search for elements and view the data from different axes.
Tensorflow’s nn part of the framework contains functions and algorithms specifically used for neural
networks. The softmax() calculates the softmax function on the input data. This function is used
regularly in deep learning.
[9]: print(tf.reduce_max(a))
print([Link](a))
print([Link](a))
2.6.7 Keras
Keras [5] contains tools for implementing state-of-the-art deep learning systems. Keras is a general
implementation of modern deep learning tools, which can use multiple back-end frameworks.
Keras can use the following backends:
1. Tensorflow,
2. Theano, a symbolic tensor framework,
3. CNTK, an open-source deep learning toolkit developed by Microsoft.
Keras remains consistent after changing the backend of the framework.
Keras provides an easy-to-use, but highly flexible implementation of modern deep learning
algorithms. However, it does not contain general machine learning or artificial intelligence
algorithms that might be required during the building of a deep learning model. Therefore, it is
advised to use the previously mentioned frameworks in conjunction in order to create a robust,
and flexible system.
Keras contains different implementations, and adheres to different coding styles for different levels
of complexity.
In this lecture note, we will review the Keras Sequential, and Layer API.
Example 2.6.7. The following example will run through a basic example of a simple neural
network learning the MNIST handwritten digit data.
38 CHAPTER 2. DATA PROCESSING TOOLS
Tensorflow contains toy datasets to test and measure the performance of machine learning models.
The MNIST dataset can be loaded with [Link].load_data()
The data is split into training, and testing data with their respective target values (digits). Ele
ments are divided by 255 because the data consists of monochrome pixel data originally. This
data is transformed in order to get them into a range which is easier to learn by neural networks.
One of the easiest ways to construct neural network models is to use the Sequential API. Different
layers can be concatenated by passing a list of layers into the sequential model.
1. The Flatten layer creates a 1 dimensional vector from an input vector of any higher di
mension by appending rows into a 1D tensor.
2. The Dense layer is the most basic representation of a fully connected feed forward neural
network layer. The first integer parameter denotes the amount of neurons that particular
layer consists of.
The following code block calculates the probabilities of what the neural network thinks the first
number is in the data set. Higher number represents higher probability. As it can be clearly
seen, the neural network cannot make informed decisions before training.
After compiling the model, using a neural network optimizer and a loss function, the neural
network can be trained using the fit() function. The number of epochs can be set in order to
determine how many iterations the model trains for. After training, history data will be returned,
which can be used in order to visualize the training results.
[5]: [Link](optimizer='sgd',
loss=[Link](),
metrics=['accuracy'])
2.6. SCIENTIFIC PACKAGES 39
Epoch 1/5
1875/1875 [==============================] - 1s 771us/step - loss: 1.3349 -
accuracy: 0.6718
Epoch 2/5
1875/1875 [==============================] - 2s 821us/step - loss: 0.5247 -
accuracy: 0.8602
Epoch 3/5
1875/1875 [==============================] - 1s 769us/step - loss: 0.3999 -
accuracy: 0.8860
Epoch 4/5
1875/1875 [==============================] - 1s 790us/step - loss: 0.3561 -
accuracy: 0.8976
Epoch 5/5
1875/1875 [==============================] - 1s 766us/step - loss: 0.3318 -
accuracy: 0.9040
The evaluate function can be used to calculate the loss, and accuracy on the testing data. Sep
arate sets of data are required in order to test the generalization ability of the neural network.
If the trained model is capable of performing well on unknown data, then it will yield a higher
accuracy.
The following code blocks will visualize the achieved accuracy and loss through the training after
every epoch.
[9]: [Link]([Link]['accuracy'])
[Link]('model accuracy')
[Link]('accuracy')
[Link]('epoch')
[Link](['train'], loc='upper left')
[Link]()
40 CHAPTER 2. DATA PROCESSING TOOLS
[10]: [Link]([Link]['loss'])
[Link]('model loss')
[Link]('loss')
[Link]('epoch')
The methods, and algorithms in the examples will be partly discussed in the following chapters.
Question 12.
Why is the choice of tools important in the data science process?
Question 13.
What are the strengths of Microsoft Excel, and what kind of problems would you solve with it?
Question 14.
What are the weaknesses of Microsoft Excel?
Question 15.
Which programming language can be used to create complex data processing routines for Ex
cel?
Question 16.
Which field of science was Matlab and R designed for?
Question 17.
What are the main benefits of using Matlab instead of Excel?
Question 18.
Which software is more fitting for complex data analysis tasks?
42 CHAPTER 2. DATA PROCESSING TOOLS
Question 19.
What does Matlab provide in order to make it easier to make complex data analysis tasks easier?
Question 20.
What are the main differences between Matlab and R?
Question 21.
What is the main focus of the R programming language?
Question 22.
What are the benefits of R in terms of accessibility?
Question 23.
What was one of the focus of R that can provide a benefit for scientific developers?
Question 24.
Is the python programming language compiler, or interpreted language?
Question 25.
What kind of programming paradigms are supported by python?
Question 26.
What is the name of the environment that runs python programs?
Question 27.
What kind of typing does Python use?
Question 28.
How are the python package managers called?
Question 29.
What are the main differences between the pip and the anaconda environments?
Question 30.
In what task does the PyCharm development platform perform better than Jupyter Notebook?
Question 31.
What is the main benefit of writing scripts instead of notebooks?
Question 32.
What development style does the Jupyter Notebook implement?
Question 33.
How does Jupyter support readability in its notebooks?
Question 34.
What are the benefits of hosting jupyter notebook servers on computing clusters?
2.6. SCIENTIFIC PACKAGES 43
Question 35.
Why do you think most data science developers choose python instead of the competition?
Question 36.
Why are packages important for the python data science workflow?
Question 37.
What are the two packages that can be used to replace Matlab and R languages?
Question 38.
Which one of the python package provides the mathematical basis for other packages and scientific
computing in general?
Question 39.
Which package can be used for professional, scientific figure generation in python?
Question 40.
Which package serves as a general machine learning framework that can be used by other packages
as well.
Question 41.
What is the difference between Scikit-Learn and the Keras package? Which one would be better for
a deep learning task.
44 CHAPTER 2. DATA PROCESSING TOOLS
Chapter 3
Descriptive Statistics
Mathematical Statistics is one of the most prominent, and heavily used applied fields of math
ematics. Statistics applies probability theories in order to process, filter data and predict out
comes.
3.1 Randomness
Descriptive Statistics is one of the sub-fields of statistics. This sub-field is dedicated to de-
scribing and summarizing the properties and attributes of data sets. Descriptive statistics was
one of the fields of science that data science heavily incorporates. Most methods of descrip
tive statistics are implemented as parts of computer software, or framework methods. In this
chapter, various python scientific frameworks will be used to implement and showcase these
algorithms.
Statistical Models
A statistical model can be described as a random experiment, that is contained within the sample space
Ω. The sample space Ω describes possible occurrences within the sample space.
A random variable can be explained by a mapping, or a function which do not have an exact
outcome. Instead, random variables map the input onto a set of possible occurrences that might
happen with that input. When used in conjunction with probability P(), they describe what the
actual probability of that outcome is. In short, the actual outcome of the random variable depends
on randomness.
The random variable X can also be described with X = (X1, X2...XN) random vector of occurrences.
The random variable X maps ω outcome of the sample space Ω onto a numeric space. The
randomness of the variable is actually the random nature of the experiment. At the moment the
experiment is performed, the variable’s value is decided. The real value of this random
experiment’s outcome is the mapping itself.
The following formula describes the nature of random variables:
X:Ω→ E,WhereΩisasetofpossibleoutcomes,andEisameasurablespace
45
46 CHAPTER 3. DESCRIPTIVE STATISTICS
The formula describes that the random variable takes all possible outcomes in the set Ω, and maps
it onto a measurable scale.
The probability of X taking a value in the measurable set S is described in the following formula:
The right side of this equation represents the probability that the random variable takes on a
value that is less or equal to x The probability is in a semi-closed interval of [i, j) where i < j.
Example 3.1.1. For example, a given a X random variable that is takes on discrete values between 0
and 5 with different probabilities.
The cumulative distribution function of X could be written as the following:
0 if x ≤ 0
1/5 if 0 < x ≤ 1
2/5 if 1 < x ≤ 2
(
FX x) = 4/5 if 2 < x ≤ 3
9/10 if 3 < x ≤ 4
10/10 if 4 < x ≤ 5
This cumulative distribution describes how the input might appear in certain parts of our random
variable distributing. Do not forget that the probability associated to the input repre sents the
probability that our random variable will be less or equal than the input.
Statistical data types are constructed with the intent to create statistical measurements of certain
attributes of the data. For this reason, not every data type fits the prerequisites constructed by
this definition. Each data type is constructed with measurements, and specialised attributes in
mind. Therefore, it is essential that the reader understands the meaning and background of
each thoroughly.
Four different levels of measurements will be defined. Levels of measurements are incremental, which
means that a particular level contains all the attributes of the previous groups.
3.2.1 Categorical
The first two levels of measurement of statistical variables is a group named categorical variables.
Categorical variables contain data values which do not have any measurable difference between
values. Each value contains a distinct piece of information, without any meaningful way of
comparing them.
Nominal
The variables contain nothing but label, which describes a class of values. Although nominal
variables can be used for classification tasks, they hold no numerical value.
Numerical calculations cannot be done on nominal variables, and therefore using them for
classification is their only significance.
◦ White
◦ Black
◦ Blue
◦ Green
◦ Something else
Whatever the answer may be, it holds only relative and subjective information. The only possible
calculation one might do with these answers is to count them up in order to count the number of
answers to a specific option. This can be used to calculate the popularity of a certain colors, and can
be used for visualization in bar, or pie charts.
48 CHAPTER 3. DESCRIPTIVE STATISTICS
Ordinal
As stated before, the next level of measurement called Ordinal, holds all the qualities of previous
levels. Ordinal variables are Nominal, meaning they hold named values, where each value can be
clearly ordered. The ordering does not hold the difference between each value, only the order
in which they are represented.
The true zero of a set of ordinal elements is unknown, and therefore scale is non-existent. As the
difference between elements is unknown, most of the numerical calculations are void as wel l.
Example 3.2.2. The following example represents a question with answers that belong in the
ordinal level of measurement:
1. Very Satisfied
2. Satisfied
3. Neutral
4. Unsatisfied
5. Very Satisfied
Ordinal variables, as stated before contain a clear ordering in their set of values. The order in
this case can be interpreted as the level of satisfaction the reader feels after reading a book, ranging
from Very Unsatisfied to Very Satisfied. The answers are incremental, but the difference between each
level of satisfaction is unknown.
3.2. STATISTICAL DATA TYPES 49
3.2.2 Numeric
The two remaining levels of measurements are categorised as numeric. They contain all the
attributes of previous levels, and they hold quantitative information, which can be used during
calculations.
The most common mathematical data types are the ones describing a number. A number is, by
definition a mathematical object, which can be used to count, measure or label. Any written
symbol, that describes a number is called a numeral. Over the history of mathematics, different
classifications of numbers have been created. Integers, and real numbers are two of the most
important number classes for data science.
Discrete (Integer)
An integer is described as a combined set of the natural numbers, and their negative counterparts.
0 is the only natural number which doesn’t have a negative counterpart, therefore it can be viewed
as the middle element of the integer set. The resulting class of numbers is called integer, and is
denoted by Z
In data science, integers are used to describe data with distinct values. This consequently
means, that the values of a discrete data column can only contain easily distinguishable val
ues. Discrete numbers are often used for classification, as the values create a distinct set of
possible values.
Continuous (Real)
Continuous variables are described as containing real numbers, and often represents measure
ments. Real number is a set of numbers which are made up by the set of integers, and rational
50 CHAPTER 3. DESCRIPTIVE STATISTICS
numbers. A real number represents a value, or point on a number line, and are mostly irrational. Real
numbers are denoted by R.
Real numbers often represent real-life measurements which does not discrete values. Such values are
for example data of stock prices or the data of drug effectiveness.
Regression problems consist of discovering correlation and connection between attributes in
order to predict outcomes with unknown attribute values.
Interval
Interval data represents values which have an equal difference between each ordered element.
The difference between two proceeding or following element is equal and standard. The problem
with interval data is that the zero element of an interval dataset is not trivial.
For this reason, only 2 of the 4 basic mathematical operations are allowed, namely addition, and
subtraction. Multiplication and division can’t be used in the case of interval dat a, as shown in the
example above Interval data can be measured on a scale, which can describe the quantitative difference
of two elements in the interval value set.
Example 3.2.3. The classic example of interval data is temperatures.
What is the current temperature?
1. -20
2. -10
3. 0
4. 10
5. 20
6. 40
In this case, the question itself is wrong. The unit of measurement is not clear, which creates an
interesting problem. Measurements could be either in Fahrenheit, or Celsius.
Using different unit of measurements nets drastically different temperatures. For example, 0 *C
equals to 32 *F. It is clear that 2 × 32 is 64, the problem occurs, when we try to multiply the same
temperature in Celsius. This clearly shows the nature of Interval variables, and the non-existence
of absolute zero value.
Ratio
The difference between ratio data and interval data is that unlike interval, ratio always has a zero
value. Ratio values have the properties of interval value which were mentioned above Distance
between values are also equal in this case. All four basic operations can be used without problems
during statistical analysis or transformation.
In ratio data, zero is treated as an absolute zero value, therefore it can’t contain negative values.
Example 3.2.4. Following the last example, one unit of measurement that is considered a ratio is the
Kelvin temperature scale.
What is the current temperature?
3.3. STATISTICAL MEASUREMENTS 51
1. 253K
2. 263K
3. 273K
4. 283K
5. 293K
6. 313K
In this example 0K represents the absolute zero value of temperature. On 0K, molecules stop
moving, and reach a state of minimal entropy. 2 × 253K nets the temperature that is twice as hot,
which is attribute of ratio variables. To calculate the multiplication of a certain temperature in
Fahrenheit, or Celsius, it has to be converted to kelvin, multiplied, and converted back to their
respective units.
Example 3.2.5. Another example of ratio measurements are height, or weight measurements.
What is your height?
1. 150 cm
2. 160 cm
3. 170 cm
4. 180 cm
5. 190 cm
6. 200 cm
In this example, 0cm represents a size of absolute zero. Height can be represented by other
measurement units, but the attributes of the ratio variables are not violated. Height can be
multiplied or divided in order to get the multiplies of some height. 150cm × 3 = 450 cm
peak in its probabilities. Bimodal, and multimodal distributions exist which have 2, or more
maximum probabilities at different parameter values. Mode is unique to statistical variables,
as it can be used even if the values are not numerical. The mode of categorical values can be a
valuable statistic.
The Median of the dataset describes the element, which is in the value-wise middle of the
dataset. If the contained value of the data set is non-ordered, it has to be ordered before cal
culating the median. The median is a very useful statistical value for representing centrality, as
the median represents the value at the middle of the dataset after ordering. Median remains
largely unaffected by outliers, and therefore can be a good indicator when searching for such
values.
3.3.1 Quantile
The median can be described as the 50% quartile of the data set. Quartiles describe data points
at a given position after ordering. The minimum is the 0% quartile, as it shows the data point
right at the start of the data set after ordering, which is always contains the smallest value. The
maximum can be also called the 100% quartile, for the same reasons. The 25% quartile, and 75%
quartile are usually calculated along with the other quartiles during data analysis. Quartiles are
a good indicator of how values span across the data set. If data is largely consistent, while one
stands out from the rest, it could be a good representation of outlier, or faulty data.
The range between the 25% and 75% quartiles are often called the interquartile range. This range
measures where the majority of the values are in the data set.
3.3.2 Variance
Variance is a statistical measure which can be used to tell how spread out the values are in the data
set. Variance uses the mean and the data values during calculation. The following formula contains
the calculation for the variance:
∑ i =1 x − )
σ2 = ( i x 2
n
Where xi is the value of the i. data, x is the mean of the data points, and n is the number of data points.
Variance measures the variability of the data set. Variability describes how far the data points spread
from the mean. The variance is squared, as the difference between data points and the mean is
squared. This results in a different scale than the original data set.
3.3.4 Covariance
Previous statistical measurements have only covered univariate measurements. The number of
parameters describe the type of statistical measurement or distribution. Univariate only takes
3.3. STATISTICAL MEASUREMENTS 53
one variable, and outputs some statistical information while Bivariate functions take multiple
variables. One of the most important bivariate measurements are correlation and covariance.
In order to understand what correlation is, it’s easier to start with covariance, as the first de-
pends on the later. Covariance can be used to describe what relationship between two variables
is. It’s either negative, or positive, signifying a positive or negative dependency between vari
ables.
The covariance of two variables can be calculated with the following formula:
∑i=1 X −) ∗ (Y −Y )
COV(X,Y) = ( i i ˆ
n−1
Y
Where X, and Y are the random variables, and X and ˆ are the means of the variables, while n is
the number of elements in the data set. The naming similarity between variance and covariance
is not a coincidence. The formula for the two values are practically identical with two variables
as opposed to one. Variance calculates the variability of the data points, while covariance cal
culates the variability between two random variables. Do not however, the value of covariance
might not be a good indicator of how strong the connection is, only whether it is positive or
negative.
3.3.5 Correlation
Correlation is supposed to describe the relationship, or how alike two random variable’s distribution
is. It can be used to describe association between elements, and can be used to express a variable’s
behaviour in contrast with another.
Understanding correlation between variables can lead to a better understanding in data. If two
variables have a big correlation, it means that they depend on each other while small correlation can
mean that the two variables are independent. Correlation can be used to uncover unforeseen, or
seemingly unseen relationships between different variables.
Correlation takes the formula of covariance, and divides by the two variable’s standard devia
tion.
COV X,Y
CORR(X,Y) = ( )
σxσy
Where σx, y are the standard deviations of the random variables X, and Y.
The value of correlation runs in the range of [−1,1]. a correlation value of means that there
is no correlation between the variables, the two variables are completely independent from one
another. The value of 1 describes two random variables that are completely dependent on one
another, while −1 describes a relationship that is inversely dependent. An inverse relationship
means that the two variables are correlated and dependent on each other, but as the value of
one variable increases, the other will decrease. Therefore, the absolute value of the correlation
can be thought of as the magnitude of dependency, while the sign is the direction.
3.3.6 Skewness
Skewness is an indicator of how much the given distribution is different from the given stan
dard distribution. The skewness of distributions can be measured in order to determine how
54 CHAPTER 3. DESCRIPTIVE STATISTICS
off the distribution is from being symmetric. This measurement can range from −∞ to +∞. A
negative measurement means that the distribution is placed mostly off center towards negative
values on the left. A positive measurement means that the values are mostly piled on the right side
of the mean, towards the positive side.
3.3.7 Kurtosis
Kurtosis is another important measurement of distributions. The kurtosis value of the data set
describes how centered the values are to the mean. High kurtosis means that the tails of the
distribution contain more data compared to a standard distribution. High kurtosis can be an
indicator of outlier data.
3.4 Distributions
A probability distribution can be described as a function which assigns a probability to each
possible subset of outcomes of a random experiment. The inputs do not need to be numerical, as
random variables can be used which map the input onto numeric space, as described before. In this
lecture, we’ll only talk about discrete probability distributions.
Distributions are an essential part of everyday work of a data scientist. Data analysis consists mostly
of applying statistical knowledge on the data in order to closely inspects attributes. Exploratory
Data Analysis can uncover hidden characteristics of the data set that be of great help when creating
machine learning models.
Data have a wide range of statistical attributes that can be wildly different from one another.
However, the set of data we are working with, or plan to work with is always a sample of real
world. The distribution in the data represent certain attributes in the features of the data. By
learning about the statistical distributions, we are enabling the better understanding of data and the
ability to predict it well.
Data types are usually composed of numerical and categorical data columns. The numerical
columns can be either continuous or discrete. In this lecture note we’ll discuss discrete
distributions Discrete random variables can be used to calculate the probability mass functions,
while continuous random variables are used to calculate the probability density function. In the
following sections, several frequently used distributions will be explored.
p if x = 0
( 1-p if x = 1
PX x) =
0 if x ∈ Ω
3.4. DISTRIBUTIONS 55
{
p if x = 0
DX(x) =
1-p if x = 1
Example 3.4.1. The easiest example for the bernoulli distribution is a simple coin toss. In case the coin
is fair the probability distribution of the coin toss is the following:
{
( 1/2 if x = 0
PX x) =
1/2 if x = 1
Where 0 is the numeric representation of heads and 1 is tails. Any other probability value in this
distribution would mean that the coin is not fair.
In the following example, a basic bernoulli distribution plotting method is created in order to
visualize the distribution of different probability pairs.
[ ]: import pandas as pd
import numpy as np
import [Link] as plt
The following example shows a loaded coin, which has different probabilities for each of it’s
faces.
if x = 1
1/6 if x = 2
1/6 if x = 3
(
PX x) = 1/6 if x = 4
1/6 if x = 5
1/6 if x = 6
[4]: plot_uniform(x,y)
The random variable of the binomial distribution describes the number of cases where individual
experiments yield a true value. The distribution has two independent parameters:
2. Number of experiments
If the random variable X follows the bernoulli distribution with parameters n, p, we write
X B(n, p). In order to see exactly k successive experiments resulting in a success, out of n
independent bernoulli experiments can be calculated using the probability mass function of the
distribution.
( )
n k
f(k, n, p) = P(k; n, p) = P(X = k) = p (1 − p)n−k
k
60 CHAPTER 3. DESCRIPTIVE STATISTICS
Where
( )
n !
= n
k k!((n − k)!)
Example 3.4.3. As with the bernoulli distribution example, we will use coin tosses to model the
binomial distribution. We’ll look at the distribution of a series of fair coin tosses, and a series of unfair
coin tosses.
If you hold 7 trials, and want to calculate the probability of 3 being true if the probability of it
coming out true is 0.3 is
( )
7
P(3;7,0.3) = 0.33 ∗ (1 − 0.3)7−3 = 35 ∗ 0.0064827 = 0.2268945
3
In the following code blocks, the binomial distribution for certain number of elements and their
probability is plotted. The x axis contains the number of elements k. On the y axis, the associated
probability of k number of elements actually coming out as trues can be read.
The Scipy package is used to calculate the probabilities for a given number of elements and
probability. Scipy provides numerous algorithms in the field of mathematics.
1
f(x) = √ ∗ e−2 ( χσµ )2
σ 2π
In the formula, the mu is the mean of the distribution, and σ is its standard deviation.
The distribution appears so often in nature that it is often used for continuous random variable whose
distribution is unknown. The importance of this distribution can be accredited to the central limit
theorem. This theorem states that the mean of many samples of a random variable with finite mean
and variance is a random variable whose distribution converges to a normal distribution as the
number of samples increase. In the real world, the sum of independent processes often follow
distributions that are small variations of the normal distribution. Normal distribution is unimodal,
meaning that it only contains a single peak.
If a random variable follows a normal distribution, it is written as X N (µ, σ2) A normal distribution
is also often called a bell curve, as its distribution over a large number of elements resembles a bell-
like shape. Examples of the gaussian distributions include school grades, financial distributions,
average heights among many more.
3.4. DISTRIBUTIONS 63
An interesting property of the gaussian distribution is that the mean, mode and median are all
exactly the same value. The distribution is also symmetric, which can be derived from the
previous property. The area under the curve always adds up to one.
Data sets that are not distributed can be transformed onto a normal distribution using
transformations such as logarithmic function, and square root values.
Example 3.4.4. In the following code blocks, a function to display normal distributions of various
sized is created. The mean µ and the standard deviation σ can passed as parameters in order to
display normal distributions of different parameters.
[3]: plot_normal(20, 8)
64 CHAPTER 3. DESCRIPTIVE STATISTICS
λk e −λ
f(k; λ) = P(X = k) = ∗
k!
X is a random variable following a Poisson distribution, which can be expressed with X Pois(λ) The
λ parameter is the rate of events happening, meaning that at every λ > 0 interval, an event can
happen. λ is a positive real number which is equal to the expected value of the random variable
X λ = E(X)
Example 3.4.5. The probability that a given event happens in n number of experiments can be
calculated using the previously discussed probability mass function. One famous example for
the Poisson distribution is the number of meteors that hit earth through out a given time frame
of years. Meteors can be modelled with this distributions because they are independent, and
the average number of meteors that get close to earth’s atmosphere is relatively constant. One
condition we’ll have to make is that meteor strikes cannot happen simultaneously.
We’ll say that actual meteors hit earth once every one hundred years. What is the probability that
a meteorite hits earth in the next hundred years? Since meteorite hits happen every 100 years, the
rate value of λ is 1.
11 e −1
P(k = 1 in the next 100 years ) = ∗ = 0.36787944117
1!
Therefore, the probability that one meteor will crash into earth is roughly 37%
The following code blocks contain the use of Scipy stat’s Poisson distribution generation function.
The distribution is plotted with matplotlib.
is that the Poisson distribution is used when dealing with the number of occurrences in a given
time frame, while the exponential distribution deal with the probability of an event occurring
after certain number of experiments.
Exponential distribution is a special case of the Gamma distribution. The gamma distribution is
used to calculate the probability of the kth event happening within a time frame. Exponential
distribution is the gamma distribution with k parameter set to one.
Example 3.4.6. The classic example for exponential distribution is survival analysis. Survival
analysis deals with the probability that a given mechanical part, or device survives a given time
frame, usually given in years. For example, the probability that a car tire has to be changed
within 3 years, according to average between car tire changes. It is called exponential, as the
chance of survival is high at newer states, but it exponentially decreases over time.
Question 42.
Why is randomness so hard to discuss both in a mathematical sense, and in computer science?
What are the main obstacles of implementing random algorithms on a computer?
Question 43.
What does Ω represent in probability theory?
Question 44.
What values are represented in a cumulative distribution function?
Question 45.
What are the statistical data types of which no difference can be measured?
Question 46.
Which categorical value can be ordered, but no distance can be measured between elements?
Question 47.
How are non-categorical statistical data categorized?
Question 48.
Which domains of numbers are usually represented as statistical data types.
Question 49.
What are the differences between numeric interval, and numeric ratio data types?
68 CHAPTER 3. DESCRIPTIVE STATISTICS
Question 50.
What are the three measurements which can be classified as "central tendencies"?
Question 51.
Which measurement can be used on categorical data?
Question 52.
What quantile is maximum, minimum, and the median of the data?
Question 53.
What is the interquartile range of the data set?
Question 54.
What does variance measure in a data set?
Question 55.
What is the relationship between the variance and standard deviation of the random variable
X?
Question 56.
How many random variables are needed in order to calculate the covariance?
Question 57.
What does correlation measure in a data set? How can you use it to tell which sets of data are
similar?
Question 58.
What are the two measurements that can be used to measure the difference of distributions of
random variables form a standard distribution?
Question 59.
What distribution would you use to model whether or not a dice roll lands on 4 or larger num
ber?
Question 60.
What are the unique properties of the uniform distribution?
Question 61.
Which distribution is a special case of the binomial distribution? What is the parameter that
creates this special case?
Question 62.
Which distribution is related heavily to the central limit theorem?
Question 63.
What central tendencies does the normal distribution have?
3.4. DISTRIBUTIONS 69
Question 64.
Which distribution can be used to estimate the number of customers in a restaurant in a given
time frame?
Question 65.
Which distribution can be used to estimate the probability of a customer arriving in a given
time frame?
70 CHAPTER 3. DESCRIPTIVE STATISTICS
Chapter 4
Data Mining
It is said that every two years the amount of recorded data is doubled. This means, that every
two years more data is generated than there were ever before. This notion alone gives the
science community, and the industry, the motivation to develop new ways to handle data. The
handling of data has not been a significant problem in modern times. A huge number of
industries and other sectors accumulate gigantic data sets every day. The problem of measuring,
data handling, and data storage is also part of the data processing process, but it is usually
handled by data engineers. This knowledge of data engineering is entirely different from what the
reader can read about in this lecture note. Nevertheless, the subject is closely related to applied
data analysis, as the responsibilities of data engineering is providing the data for later processing
by statistical, and machine learning methods.
Data science encompasses many different sub fields, and applications. The foundation to these
processes is data analysis. Mathematicians have long developed methods and algorithms which are
useful when dealing with large sets of data. The field of statistics was specialised in making sense
of large data sets. Formal methods of statistics can be used in order to extract useful knowledge
about data.
Data analysis is the process of using methods and techniques to extract previously unknown
properties and information from organized data. Of course, there are a lot of methods borrowed
from the field of statistics.
The process of analyzing data can be as simple as looking at the data, and intuitively figuring out
inferences in it. In truth, many of who we now call data scientists have started their career as a
statistician. Recently, data science has drifted away from the theoretical world of mathematics, into
computer science.
The main goal of data analysis is to understand what kind of values, data types and other
quirks are included in the data set. The extraction of this information is essential, as methods and
applications heavily rely on this knowledge.
71
72 CHAPTER 4. DATA MINING
word datum. Being rarely used suggests that the times where datum are singular is nearly non-
existent.
For this reason, creating structures of datum which represent complex data structures is
necessary. One of the simplest way to organize data is to put them into an array of a data type.
Mathematical data structures such as vectors and matrices were described in algebra long before
computers existed [13, 20].
Arrays
Example 4.1.1. Let’s imagine that you are interested in predicting the weather 1 hour from your
current time. In this example, all you have is a thermometer, paper and pen to work with.
If you start writing down the temperature after every hour, you will get an array of temperatures.
The logical way to record temperatures are with real numbers. In order to digitize this data, you
have to save it into a float, or double array.
After you have created this array, you have successfully put your data into a structure that is easy
to use, and understand.
Please note, that the data in this array is not indexed by time. The only indexing it contains, is the
sequence that the data is written down.
15,5 16,4 17,6 18,1 19,2 19,5 19,3 19,4 19,5 18,9
While arrays of data can be processed easily, they cannot be intuitively understood by looking at the
data alone. The values of temperatures could be represented by different units of measurements,
can be recorded at different times. Therefore, external information must be given in order for arrays
of data to be understood.
Matrices
Adding another dimension to an array results in the creation of a matrix. Matrices contain rows of
data, where each row can represent an array of data.
Just as in arrays, the data in a matrix is not labelled. Therefore, the structured data structure itself
does not provide important information about the data.
Example 4.1.2. The data from the last example can be structured further. After you have
recorded the temperatures of a day, you can begin recording temperatures of the next days. After
you have recorded the temperatures of a few days, you can arrange them into a matrix. The data
can be sorted so that each row can represent a different day, while each column represents a different
point in time.
15,5 16,4 17,6 18,1 19,2 19,5 19,3 19,4 19,5 18,9
15,3 15,6 16 16,2 17 17,2 17,3 15,7 15,3 15,1
15,6 15,7 16,2 16,9 18,1 19,2 19,9 20,8 19,6 18,3
If a spreadsheet software is lacking in features for a certain problem, it is advisable to use more
advanced tools, which were described in the previous chapter Tools. Advanced tools most often do
not have their own proprietary extensions, but some can parse spreadsheet encoding. However, it
is advisable to use CSV as a format for data processing tasks. All of the mentioned software are
capable of exporting data as a CSV.
Features
After parsing data into the spreadsheet software or data processing framework of choice, it es
sential to separate data based on description. Data sets usually come with descriptions of what
each feature is, and what is the purpose of the data set. Features are a column , representing
measurements in an abstract data set. Features could be any previously mentioned data type.
In order to parse data of certain types for example strings, or characters, special encoding techniques
can be used [11]. For example, standardization of numeric data can increase effectiveness of
methods. Encoding techniques can be used to transform otherwise unusable data to a process-able
format. Encoding techniques include:
1. One Hot Encoding
2. Label Encoding
3. Embedding
Some of those techniques will be explored in the following chapters.
Target Values
If your goal is to find inferences in the data in order to predict a feature, you are either classifying an
object, or solving a regression problem. In both of these cases, there are a number of features which
you use to predict a different feature. The feature you are trying to predict is usually called the
target feature.
74 CHAPTER 4. DATA MINING
Data set descriptions usually come with predefined target features, but you could choose any
feature as a target, as long as it is processable with statistical methods, or machine learning.
Example 4.1.3. In this example, a CSV will be imported from the local file system into Pandas. Some
components have been already explored in subsection 2.6.3.
[2]: df=pd.read_csv('hotel_bookings.csv')
[3]: [Link]()
arrival_date_week_number arrival_date_day_of_month \
0 27 1
1 27 1
2 27 1
3 27 1
4 27 1
reservation_status_date
0 2015-07-01
1 2015-07-01
2 2015-07-02
3 2015-07-02
4 2015-07-03
[5 rows x 32 columns]
[4]: [Link]()
In the following block, general statistics are displayed with the describe() function. It is always
good idea to check the data’s statistical parameters before transforming it. Sanity checking helps
the developer to avoid anomalies, and better understand the data set.
arrival_date_week_number arrival_date_day_of_month \
count 119390.000000 119390.000000
mean 27.165173 15.798241
std 13.605138 8.780829
min 1.000000 1.000000
25% 16.000000 8.000000
50% 28.000000 16.000000
75% 38.000000 23.000000
max 53.000000 31.000000
previous_cancellations previous_bookings_not_canceled \
count 119390.000000 119390.000000
mean 0.087118 0.137097
std 0.844336 1.497437
min 0.000000 0.000000
25% 0.000000 0.000000
50% 0.000000 0.000000
75% 0.000000 0.000000
max 26.000000 72.000000
Machine learning is another subset of Data Science, that is closely related to Data Mining. Machine
learning models are usually made up of black box algorithms Machine learning uses artificial
intelligence models to learn the inferences of a specific data set. The models are trained to recognize
attributes without explicit instructions.
Machine learning usually utilize the processed data set created by the data mining process.
The data set usually contains hidden knowledge and inferences, which are utilized to create
application which can seemingly make intelligent choices.
loaded into Microsoft Excel, or other spreadsheet managing programs. These tools can be used to
create simple statistics, diagrams, and extract information from smaller data sets.
On the other hand, complex data sets that contain high number of features require more sophisticated
tools to manage. These tools often require a deeper knowledge in the field of computer science to
manage. One of the most widely used tools for managing enormous datasets is the Pandas package
of the Python programming language.
Feature Selection
Data sets often contain a lot of different features. These features describe different values of
specific elements. These instances often contain redundant, or non-relevant data to the problem
at hand.
Therefore, collecting information about the features of the data set is essential for making informed
decisions about which features to omit.
After the features has been selected, they can be extracted into a data set with reduced size. The
key to selecting features is to not lose any information related to the solution of the problem.
The most important tool while selecting data is common sense. Complex systems can detect
certain attributes, or compare elements in order to calculate a ratio, but they cannot deduct which
scenarios are impossible.
Example 4.3.1. For example, let’s look at a data set, which contains information about vehicles. The
following features can be found in the data set:
If your task was to analyse boats in this data set, it would be sensible not to include the "# of
Wheels" and "Has Wing" features. Of course, if you simply omit these features vehicles with wheels
or wing would be indistinguishable from boats. Therefore, the data would have to be filtered first in
order to remove rows containing other vehicles.
Example 4.3.2. Let us look at a data set with the following features:
Name Country Age Sex AVG. Work Hours AVG. Salary / Month
Height Weight Birthday Date Martial Status # of Pets Education
Question 66.
Imagine working at a company, where your job is to analyse worker efficiency, and satisfaction.
Worker satisfaction is rated from 1 (Worst) to 10 (Best).
Which feature would you omit, in order to increase efficiency of your machine learning system?
After you have thought about the question for a moment, you may have realised that choosing
features based on these criteria is not an easy task. You could guess what kind of data types a
feature might hold, but without this knowledge, it is hard to make an informed decision.
Methods have been already created that are capable of selecting features that are redundant for
the solution of the problem. While these methods are important, they cannot replace common
sense.
4.3. DATA MINING PROCESS 79
For this reason, the first step should always be the manual analysis of the data set. This
increases the effectivity of the aforementioned automatic feature selection methods, because the
algorithms have a reduced number of features to work with.
The following two algorithms are often used in order to reduce the number of redundant fea tures
in data sets.
Example 4.3.3. Let us once again load the data with the panda’s package. The Iris data set will be
loaded, and different selection methods will be presented.
[2]: df=pd.read_csv('[Link]')
[3]: [Link]()
The column names of the data set can be examined by viewing the [Link] object of the data
frame.
[4]: [Link]
[6]: df_selected.head()
3 4.6 3.1
4 5.0 3.6
The second example uses selection based on indexes. Elements after the 100th rows are selected
that are after the third column.
[7]: df_selected_id=[Link][100:, 3:]
[8]: df_selected_id.head()
are orthogonal to each other. Therefore, principal components are linearly independent to each
other. Lastly, N number of these principal components are selected in order to transform them into
the dimension of N.
Example 4.3.4. Luckily the algorithms and methods read in this lecture note have already been
implemented, and can be used readily by the reader. The PCA algorithm has already been
shown in chapter 2.
Nevertheless, we will proceed with the new found knowledge and implement it again with
three principal components.
Once again, the training data, and targets are separated. The data is scaled, while the targets are
encoded using the LabelEncoder class.
[2]: iris_df=pd.read_csv('[Link]')
[4]: x_scaled=StandardScaler().fit_transform(x)
[5]: le = [Link]()
y_num=le.fit_transform(y)
The following block contains the PCA transformation of the original data set. The PCA is set to
extract 2 principal components with the n_components parameter, resulting in a two dimensional data
set. The principal component data frame is created with the columns of pc1, pc2.
The following code block creates a three dimensional principal component data frame with
n_components set to 3. After the components have been extracted, the three dimensional data frame
is plotted using matplotlib’s three dimensional plotting capabilities.
[Link]()
pca = PCA(n_components=3)
principalDf =[Link](data = pca.fit_transform(x_scaled), columns =␣
↪→['pc1', 'pc2', 'pc3'])
4.3. DATA MINING PROCESS 83
edgecolor='k')
[Link]()
matrices, while PCA cannot! However, SVD also suffers from sign indeterminacy, which means that
the output largely depends on the starting random state. Truncated SVD is a variant of SVD
which is used to truncate the resulting Matrix of values into a certain dimension. This is the variant
used to reduce the dimensionality of data, as it can convert the data into a different dimension by
truncating singular value vectors.
Example 4.3.5. In this example, the use of Scikit-Learn’s TruncatedSVD class is demonstrated.
The class performs the truncated SVD algorithm in order to reduce the dimensionality of the
data.
[2]: iris_df=pd.read_csv('[Link]')
x, y=iris_df.iloc[:,1:-1], iris_df.iloc[:,-1:]
x_scaled=StandardScaler().fit_transform(x)
le = [Link]()
y_num=le.fit_transform(y)
The following code block contains the creation of the singular value vectors. The passed pa rameter
of the TruncatedSVD class is target dimension.
ities in our data set. The minimum, 25% quartile, median, 75% quartile and the maximum is
particularly useful in finding outliers in our data.
Example 4.3.7. If all of your data is increasing linearly up until the 75 %quartile but jumps
exponentially in the last quantile, you might want to inspect it. A few outlier data points can
potentially alter the outcomes of statistical methods and machine learning models in a way that the
outcomes might not represent the original data well.
There are a few solutions to this problem, which are transforming feature data using a logarith
mic function, or clipping. Using the logarithmic value of the feature values can minimize the
effect of extreme outliers. Clipping is the process of deleting rows of data based on a logical
criteria. If you have the intuition, or knowledge about a subject, and after you check the data,
you find outliers that could not exist in the real world, it can be considered as a measurement
error. The best action to take in this case is to clip the data above, or below the threshold.
[2]: iris_df=pd.read_csv('[Link]')
[3]: iris_df
[6]: x
SepalPetalLenghtCm
4.3. DATA MINING PROCESS 87
0 6.5
1 6.3
2 6.0
3 6.1
4 6.4
145 11.9
146 11.3
147 11.7
148 11.6
149 11.0
Question 67.
What is the data structure which can be used to store and organize a set of measurements?
Question 68.
What data structure would you use to store several sets of measurements?
Question 69.
What is a difference between a data set and a matrix?
Question 70.
What data types can a data set hold?
Question 71.
What software can be used to create and analyze small data sets?
Question 72.
What is the relationship between the features and the target of the data set?
Question 73.
What are features? What is the difference between raw data array and a feature vect or?
Question 74.
What encoding techniques can be used to encode categorical data?
Question 75.
What sort of data types can target values be?
Question 76.
How can you display a descriptive analysis of the data set with pandas?
Question 77.
What is the name of the subfield of data science which is used to analyze and extract information
from data?
88 CHAPTER 4. DATA MINING
Question 78.
What are the steps of the data mining process?
Question 79.
What is the purpose of defining the problem at the start?
Question 80.
Which sources can be used for data sets?
Question 81.
What tools are recommended when working with big and complex data sets?
Question 82.
What is the point of selectively choosing features from the original data set?
Question 83.
What category of algorithm does the principal component analysis fall under?
Question 84.
What does the PCA algorithm achieve?
Question 85.
What are the essential steps of the PCA algorithm?
Question 86.
How can the PCA algorithm be used in visualization?
Question 87.
What are the similarities between the PCA and the SVD algorithm?
Question 88.
What is the name of the variation of the SVD which can be used to reduce dimensionality?
Question 89.
What is the problem with using raw data during data processing?
Question 90.
What are the types of data that have to be transformed in order to be use d as features?
Question 91.
What is the benefit of using feature crosses?
Question 92.
Go to [Link] and download a dataset of your choice. Try to import it into Excel, and create
bar plots and pie plots of certain columns. Try to import the data set into a pandas data frame
using pandas.read_csv(csvname). After importing the data, look into what functionalities can
be used on DataFrames using the following documentation .
Chapter 5
Machine Learning
In this chapter, we will talk briefly about the recent developments of machine learning tech
nologies, and methodologies. Artificial Intelligence has always moved the human mind greatly
throughout history. Even ancient Greek philosophers wrote epics about moving, acting, and
thinking machines. While the idea has remained science fiction for the majority of human his-
tory, it has recently gained ground as a truly viable, and achievable goal for humans.
Modern Artificial Intelligence [16, 5] will most probably not be able to freely think, or feel for
themselves, but they are able to solve problems beyond human capabilities. Therefore, machine
learning models are not considered "intelligent", but rather an excellent tool to solve specific
problems.
Programmers have been able to solve complex problems for more than half a century. While
computer science is not an ancient science, it has achieved much over its short life span. The dif
ference between programming a solution, and machine learning is the explicit programming.
Most problems could be solved by a machine by explicitly programming it to solve the ab
stracted problem.
The machine uses it’s memory, and it’s central processing unit in order to calculate it’s next action,
although it is not acting on its own whims. The machine is just automatically performing the actions
that it’s programmed to do.
Machine learning however, takes a different approach to problem solving. This approach takes
a methodical, and formal examination of the problem, and the data in order to create learn
ing models. The biggest requirement of these types of systems is data. In the last 30 years the
amount of recorded data has been growing exponentially. The other reason machine learning
models can thrive in the 21st century is the growing power of computers. Machine learning
models could not have been made any earlier, as they usually require huge amounts of com
puting power.
Therefore, the two biggest requirements for machine learning are computing power and data. If one
possesses both of these requirements, modern machine learning systems can be integrated into
applications, or used independently to predict complex outcomes.
Machine learning processes can be categorized into two distinct classes, supervised, and un -
supervised learning processes. We’ll briefly discuss the particularities of both categories in the
89
90 CHAPTER 5. MACHINE LEARNING
following sections.
f : vN → t, where v, t ∈ R
The f function is the mathematical function equivalent of the original system. If we do not
know how the system processes the input in order to create the result, the system is essentially
a black box.
The goal is to mirror the system, by creating a function g with the same mapping as the original.
Mathematically, we can create the following basic formulation for supervised learning:
Where g is the learning function that we use to learn the original function f. The vN is the
N long feature vector, and t is the target we’re trying to predict. The "learning" is trying to
approximate the function f with different methods without explicitly knowing the original
vN → t mapping.
Of course, we can also formulate a number which describes the difference between f and g:
L=l(f(v)−g(v))
The L is usually called the loss of the model. It is called a loss because it measures the difference
between the actual output, and the predicted output from the function g. The function l is
called a loss function. The loss function is some mapping of the value that will result in a good
5.2. TRAINING AND LOSS 91
measurement of loss value. This function can be different for different tasks. For example, if you
would like to classify cats and dogs, you would like to use a loss function that calculates the
amount bad answers made by function g.
If the loss is 0, the functions f and g are identical, and we have successfully created a machine
learning model which outputs the same exact result as our black box f function. Therefore our
goal should be to minimize L by changing the learning function g to better learn the mapping
of f.
Of course, this does not happen, and should not happen because approximating a function will
always have errors. The original f function will work whatever input combination it results,
and will always output the correct result. As the function g learns from examples, it’s knowl
edge goes only as far as the experienced input, and output combinations. So if g receives an
unknown combination of feature values it might output slightly different values.
Train-Test-Validation split
The data set can be split even further into three separate data sets, which increases the testing
accuracy, and generalization ability of the model. The training, and the testing set remain, but
a new data set is created called the validation set. The validation set is used for calculating the
92 CHAPTER 5. MACHINE LEARNING
loss after the training phase in order to determine model accuracy. Performing modifications on
the model could result in an increase of accuracy on the validation set.
This is required if the particular model has parameters which are not tied to the training itself, but
rather serve as an inner parameter of the model. The process of finding such high -accuracy
parameters is called hyper parameter search. Finally, after achieving desirable accuracy on the
validation set, the testing set is used to evaluate the final trained model.
Split size
The problem with splitting the data set is that the model will have less data to train on. How -
ever, if the testing data is not used during separate testing phase, model generalization ability
might suffer. Therefore, it is a balancing act to determine the right ratio of training, and testing
data. Decreasing the amount of training data will result in a less accurate model, but disre
garding the testing data creates a situation where there is no way to properly test the model
accuracy.
It is advised ratio of split is 77% training data, 33% testing data. However, it is always important to
keep in mind that there is no golden rule. Smaller data sets might require smaller testing sets, as models
might not be able to properly learn from a small sample of data.
Example 5.2.1. In this example, we will use scikit learn’s train_test_split() method to easily split the data
set into 2 parts:
1. Training feature vectors and target features
2. Testing feature vectors and target features
The dataset will be used in regression problems in a later example.
values. Instead, the output values are the product of a continuous function, therefore they are real
numbers. Examples of regression tasks are housing price prediction, and forecasting. Lin ear
regression is among the simplest form of regression. Regardless, linear regression is often used
to this day to approximate simple functions.
In linear regression tasks, a combination of input features are used in order to calculate an
output value in an N dimensional plane. With one predictor feature, the linear regression model can
be formulated as:
f :x→ ŷ
Where x is the input feature, y is the target value and ŷ is the predicted outcome. The goal is to
determine the linear function f in which the loss between ŷ and y is minimal.
The linear function with 1 feature can be formulated as:
f(x) = w0 + w1 ∗ x
This formulation is same as the equation of the line, y = m ∗ x + b where m is the slope, and b is the
y-intercept value. In this equation, the y, and x are set, as the y are the actual output, and x is the
input variable. In order to change the model predictions, w0 and w1 can be changed. Changing
these parameters changes the outcome of the function, and therefore the loss. Originally, the function
was optimized using the least squares method, but a number of different loss measurements has been
developed over the years:
1. Mean Squared Error, or MSE.
n
1
Loss =
n
∑ (target value − predicted value)2
i=1
MSE calculates the squared difference between the actual output and the predicted out put.
This error is the is used in the least squares method, and provides a good measure of estimator
quality. 0 MSE means that the estimator predicts the same output as the original outputs for
every input. The problem with MSE is that outlier values can generate large amount of error,
which can be problematic during training.
2. Mean Absolute Error, or MAE.
n
1
Loss =
n
∑ |target value − predicted value|
i=1
MAE calculates the average absolute error of the given predictions. The error does not
account for direction. MAE handles errors in an absolute manner. In case that the errors can
be represented along a linear scale, MAE can be a useful measurement of error. For example,
in case the error measurement of 10 is half as bad 20.
3. Root Mean Square Error, or RMSE
√
1 n
Loss =
n ∑ (target value − predicted value)2
i=1
The RSME takes the root of the MSE error and is indifferent to direction of the error. RSME
represents the quadratic mean of the differences between the actual, and the predicted
94 CHAPTER 5. MACHINE LEARNING
output. The problem of outlier data is still present in RMSE, although in a less problematic
way. RSME is useful for calculating the loss function when the large errors with outliers are
particularly undesirable. If errors are on a quadratic scale, RMSE can be a particularly useful
measurement of error.
Linear regression can be easily visualized in two dimensional plots, but can be performed on
higher dimension data sets. One of the most important attributes of linear regression model is
that they are only capable of finding linear relationships between features and the target. In
short, the algorithm is trying to find a straight line across the data points which best represents
the linear relationships between the feature vectors, and the target feature. If the data points
are present in the given N dimensional environment in such way, that a straight line cannot
represent them well, then linear regression is not an appropriate model for the task.
Example 5.2.2. In this example, a simple linear regression model will be created using the pre
viously split Diabetes data set. Scikits LinearRegression() model is used to fit the line onto the
data set. Using the fit() method trains the model with the training data Xtrain and the target
features ytrain. Finally, the predicted targets are calculated using the .predict() function with
the test feature values.
The slope, and the y intercept is inspected using the regressor’s coefs_, and .intercept_ attributes.
Previously discussed loss measurements are calculated using scikits mean_sqare_error() and mean_absolute_error() functions.
Coefficients:
[972.87627375]
Intercept point:
150.2626749624518
Mean squared error:
3934.0672763273196
Mean Absolute Error:
50.968422060002595
Root Mean Squared Error:
62.722143428994194
A bar plot is generated in order to show the difference between the first 25 predicted and actual output
value. Errors between the two outputs can be seen in the plot.
5.2. TRAINING AND LOSS 95
A line plot is generated in order to show the predicted line across the actual outputs.
The example shows a data set on which it is hard to find a line which fits the data well. All
measurements of error are relatively high, and therefore a linear regression model might not be the
best solution to our problem.
P(x)
1−P(X)
Let us say that out of 100 tries, a person will have 0.25, or 25% probability to guess the right
answer to a question with just one hint. Therefore, the odds are 0.25 0.33333333333. If for
1−0.25 =
some reason, they would be able to guess 80% of the time with one hint, the calculation would
be 0.8 4. As you can see, the difference between probability and odds is that odds are not
1−0.8 =
bounded between 0 and 1, but between 0 and ∞.
The trick is to take the logarithm of the odds in order to transform it onto a range of ( −∞, ∞). This
transformed log-odds value is linear across all probabilities. The goal of logistic regression is to use
an estimator model to approximate the log-odds mapping of the original data. For this, we use
regression analysis.
The problem becomes:
w0 + w1 ∗ x = log(odds)
5.2. TRAINING AND LOSS 97
Where x is the feature vector (in this example, the hints). The line parametrized by w0 and w1 is
called the decision boundary of the logistic regression. The line creates two distinct classes in the
data, and depending the odds of the input, the data can be categorized into these classes based on
which side the odds is. Luckily, the probability of the outcomes, which are called logits, can be
calculated using the logistic function.
1
f(x) =
1+e−x
The cost function is that we calculate the odds of the input values using the two variables w0 and
w1 and logistic function.
Logarithm is used during the error function calculation. The error function in the case of logistic
regression called the logistic loss.
The individual loss of a given data point can be calculated by:
{
= -log(logistic(xi)) if y = 1
Lxi -log(1-logistic(xi)) y=0
Where y is the actual class of the given input row xi. In order to calculate the loss of the logistic
regression, the mean sum of all errors has to be calculated.
m 1
m ∑LLx x=i
i=1
If there are more than two possible categories of outcomes, multinomial logistic regression models are
used.
Example 5.2.3. The following example will contain the use of logistic regression on a three class
classification task. The Iris data set is used as the data, where the selected features are the sepal
length, and width, and the target feature is the variety of iris flower encoded onto an integer
array.
[4]: y_pred
The predicted values for the flower variety can be inspected as the return value of the regressor’s
predict() function.
[4]: array([1, 0, 2, 1, 2, 0, 1, 2, 1, 1, 2, 0, 0, 0, 0, 2, 2, 1, 1, 2, 0, 2,
0, 2, 2, 2, 2, 2, 0, 0, 0, 0, 2, 0, 0, 1, 2, 0, 0, 0, 1, 2, 2, 0,
0, 1, 2, 2, 2, 2])
As the predicted values are discrete, the values of the bar plot show which values were predicted
incorrectly by the logistic regressor.
The following figure shows the data points with x axis being the continuous sepal width, and the y
axis being the flower type. The color blue is assigned to the actual target values, while the red dots
represent the predicted values. It can be clearly seen, that the regressor has misclassified some of the
data.
[6]: [Link](dpi=150)
[Link](X_test[:,1], y_test, color='blue')
[Link](X_test[:,1], y_pred, color='red')
5.2. TRAINING AND LOSS 99
[Link](())
[Link](())
[Link]()
The last figure shows the data points in a two dimensional space, along the sepal length, and the
sepal width axes. The colour of the dots represents the actual target values, and the background
colour represent the predicted decision boundaries.
[7]: [Link](dpi=150)
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
h = .02
xx, yy = [Link]([Link](x_min, x_max, h), [Link](y_min, y_max, h)) Z
= [Link](np.c_[[Link](), [Link]()])
Z = [Link]([Link])
[Link](1, figsize=(4, 3))
[Link](xx, yy, Z, cmap=[Link])
[Link](X[:, 0], X[:, 1], c=y, edgecolors='k', cmap=[Link])
[Link]('Sepal length')
[Link]('Sepal width')
[Link]([Link](), [Link]())
[Link]([Link](), [Link]())
[Link](())
[Link](())
100 CHAPTER 5. MACHINE LEARNING
The calculated error is then used to iteratively modify the model’s weights and biases in order to
perform better at the given task. The modification of these parameters can be different for different
models. We are going to present the method of machine learning training on a Feed Forward
Artificial Neural Network Model (Feed Forward ANN).
5.3.1 Perceptron
Artificial neural network models have been one of the most important technological tools of the past
10 years. However, artificial neural networks have existed long before that.
The idea of a network of artificial neural networks originates from the 1970s, when mathematicians
formulated an artificial agent. The agent was called a Perceptron [15], and it was capable of processing
incoming data through incoming connections, and then outputting the processed outcome though an
outgoing connection to the environment. The basic structure of the artificial perceptron can be seen
on Figure 5.1.
5.3. TRAINING PROCESS 101
The idea behind the perceptron is to transform the list of input values in a way that outputs a
given target output. The similarities to the previously discussed regression models in subsection
5.2.2 and subsection 5.2.3 are not a coincidence. Perceptrons essentially perform the same tasks as
the regression models.
For a given input vector X, the perceptron outputs an output ŷ. In linear, a linear function was
searched for given data points in such way that the difference between the output of the linear
function and the original dataset was minimal. Perceptrons perform the same task, by weighting
certain input values. Input connections of the perceptron represent the different features of the
data set.
A perceptron constructed to learn a data set with two features and one output target would have
two input connections, and one output connection. The different input connections would transmit
the different feature values to the perceptron. After receiving input, the perceptron calculates a
linear combination of the inputs. Each input is weighted in the linear combination according to a
weight parameter of the connection.
For two input values, the calculation uses the following formula:
(x1 ∗ w1 + x2 ∗ w2) + b
Where x1 and x2 are the feature values, w1 and w2 are the connection weights, and b is a bias.
The bias is a unique weight of the neuron, representing the y intercept value of the linear func
tion.
Finally, the result of the linear combination of inputs and weights is passed through an activation function
f.
o= f((X∗W)+b),W =[w1,w2],X=[x1,x2]
X∗Wisthedotproductoftheinputvector,andtheweightvector.
102 CHAPTER 5. MACHINE LEARNING
Activation Functions
Activation functions can be any function, but specific functions have become standards over the
years. Originally, no activation function has been used, and this can be represented with an identity
function.
f(x) = x
Identity function can be useful when trying to solve regression problems with a perceptron. The
problem with the identity function is that the linear combination of values is only capable of
producing linear outputs. Therefore, if the data set contains non-linear relationships between
features, the perceptron will not be able to predict them effectively. This was such a big problem, it
originally caused one of the AI Winters.
Nobody thought that the perceptron was of any use, and they abandoned the idea altogether.
The research surrounding perceptrons and neural nets continued after they introduced new,
non-linear activation functions. Non-linear activation functions allow the modelling of non-
linearity of the data set. The first non-linear activation function was the hyperbolic tangent (tanh())
function.
1 − e−2x
tahn(x) =
1+e−2x
As mentioned before, the tanh function is a non-linear function. It is bounded in the range of
(-1, 1). This unique boundary gives the perceptron unique a unique output transformation that
can be negative. After the success of the tanh function, many more activation functions have
been identified to have unique and useful properties for the output transformations.
The logistic function has been discussed before in subsection 5.2.3. The logistic function itself
is a non-linear function, that transforms any value into the range of (0,1). This can be useful,
as it transforms values into the boundaries of probability. In modern machine learning the
logistic function has been called the sigmoid function, and both of these terms refer to the same
function.
1
σ(x) =
1+e−x
Logistic regression has shown that the logistic function can be used for modelling probability, and
therefore classification. However, the logistic function can be mostly effective if the classes are binary.
The principles are the same with the perceptron.
Another useful activation function for classification tasks is the softmax function. The softmax
function can be used for multi-class classification tasks. The output of this activation function is a
vector, containing probabilities of each class. These classes of course make up all possible choices,
therefore the sum of all probabilities in the vector is 1. Therefore, the softmax function generates a
probability distribution from the input features.
ex
softmax(x) = i
∑j ej
5.3. TRAINING PROCESS 103
A newly discovered variant of the identity function has been used for neural network training is
the relu function. relu stand for Rectified Linear Unit, as it is a linear identity function with a cut-off
at y=0.
{
0 if x < 0
relu(x) =
x if x >= 0
Relu is useful, as it is non-linear and easy to compute, despite being called a linear unit. Because
of the easy computation, it is also fast. Relu retains important information in its original form,
instead of transforming it onto different boundaries. The downside of this property is that the
relu function does not scale big values down, which can cause problems during calculation.
Many of deep learning’s successes are attributed at least partly to the relu activation function.
Example 5.3.1. Let us go through an example of what a perceptron calculates with a given input
data. To summarize how perceptrons work, will use the individual parts in the following example.
The perceptron itself processes the input using the following formula: o = f((x1 ∗ w1 + x2 ∗ w2) +
b).
1. x1, x2 ∈ X is the input vector
2. w1, w2 ∈ W is the weight vector
3. b is the bias of the perceptron
4. f() is the activation function
Our calculation Let us say that W = [2,1], and b = 2. W is the vector form of (w1 = 2, w2 = 1).
Let us assume that our input X is [1,3].
The calculation should now look like this:
f((1 ∗ 2 + 3 ∗ 1) + 2) = f(5 + 2) = f(7)
f(7) = 0.99999
[1]: import numpy as np
def sigmoid(x):
return 1/(1 + [Link](-x))
[2]: sigmoid(7)
[2]: 0.9990889488055994
So, for the input values of X = [1,3] to a perceptron with weights W = [2,1] and b = 2 the
output is 0.9999.
Let us create a basic implementation of this structure in python.
The first block contains a Perceptron class which implements the basic functionalities of the per-
ceptron. The __init__ method is the constructor of the class, which can be called by instantiating
104 CHAPTER 5. MACHINE LEARNING
the class by Perceptron(). The weights, and bias are parameter values, while the activ_function represents
the activation function which is also passed as a function parameter.
class Perceptron:
The sigmoid function is used for the activation function, which can be passed as a parameter.
[3]: W=[2,1]
b=2
perceptron=Perceptron(W, b, sigmoid)
The execute() function can called with a set of input parameters in order to calculate the outcome. The
calculated results are the same as the values calculated by hand.
[4]: [Link]([1,3])
[4]: 0.9998766054240137
There are three distinguished layers in feed forward networks. The input layer represents the
incoming data features. The number of nodes are the number of features in the data set. The
output layer contains the neuron nodes, and the outputs are calculated by the neuron calcula
tions.
If there are any layers in between the input and the output layers, it is called the hidden layer. The
hidden layer provides complexity to the artificial neural network. A single layer of neurons in the
hidden layer provides a way to further process the data before going to the output neurons.
Increasing the number of hidden layers creates an inherent complexity of calculations that is very
hard to follow, and calculate by hand.
Each neuron connection contains separate weight vectors and each neuron has separate biases.
This structure is sufficiently complex to approximate complicated functions. The problem with
increasing the number of hidden layers is that the model essentially becomes a black box, with
little to no way to explain the reason behind outcomes.
Example 5.3.2. In the next example, the output of a simple neural network with 3 neurons will be
calculated.
Let us work with our previous example. Assume that the connection of all neurons have the
same weight vector W = [1,2], and bias of b = 2. The network consists of an output neuron, and
two hidden neurons.
106 CHAPTER 5. MACHINE LEARNING
With the same X=[3,1] the calculation of the hidden neurons will be the following:
1.
f(W ∗ X + b)
2.
h1 = sigm((1 ∗ 3 + 2 ∗ 1) + 2)
3.
h1 = sigm(7) = 0.999 = h2
Since all weights and biases are the same, the output of both hidden neurons is going to be the same.
Since the hidden neurons are connected to the output neuron, the output is sent to the output
neuron.
1.
o=sigm((1∗0.999+2∗0.999)+2)
2.
o=sigm(4.997)
3.
o=0.993
Therefore, the final output of the artificial neural network will be 0.993.
In the following code blocks, the implementation of a simple 3 neuron artificial neural network
is implemented. Separate weights, biases, and activation functions could be used, but all neu
rons use the same weights for the sake of keeping the implementation simple. The perceptron
model from the last example is renamed to Neuron. The ANN class uses global weights and
biases with 2 hidden, and one output neuron. The execution of the neural network calculation
5.3. TRAINING PROCESS 107
is called by the output neuron’s execute function with the return values of the hidden neuron’s
outputs.
class Neuron:
[4]: W=[2,1]
b=2
ann=ANN(W, b, sigmoid)
[5]: [Link]([3,1])
[5]: 0.993304687623845
You might wonder how these model is capable of predicting the outcomes of complex processes.
The network is composed of only weights and data transformations. How is this going to
classify what kind of flower is being described, or whether or not a person has diabetes based
on previous data?
Of course, a neural network can be thought of as a form of function approximation. There is
a hidden function, which takes a number of inputs and outputs the right answer all the time.
As discussed before, machine learning models are trying to learn this hidden function through
learning. But what is the process of learning exactly? In the case of linear and logistic regression,
108 CHAPTER 5. MACHINE LEARNING
it was the modification of two parameters, and calculating the loss. The training process is
exactly like that for all supervised machine learning models.
Training the neural networks involves modifying independent parameters in order to slightly
modify the outcome.
1. First, the output is calculated for a set of the input data.
2. The loss is calculated between the actual outcome, and the predicted output s.
3. Based on the loss, an optimizer algorithm is used to slightly modify the parameters of the
model.
The optimizer must be able to decide which parameter to change in what way. Slight change of
model parameters can change the output drastically. Therefore, changing the parameters once
is not enough. In machine learning an iteration of training is usually called an "epoch".
The model is trained until the loss of the model is less than a minimal amount of difference, or the
predefined amount of epochs have passed. In the case of the neural network, the trainable
parameters are the weights and biases of each connection between the neurons.
We’ll use our neural network, and we’ll get the following table of outputs:
Index y ŷ
0 1 1
1 0 1
2 1 1
3 0 1
Let us calculate the mean squared error of the actual and the predicted outputs.
1 1 2
MSE = 4 ((1 − 1)2 + (0 − 1)2 + (1 − 1)2 + (0 − 1)2) = 4 (0 + 1 + 0 + 1) = 4 0.5
=
5.4. MODEL OPTIMIZATION METHODS 109
So, despite our model only producing 1-s, our model is still right 50% of the time! Some might
call that a win, but you might quickly realise that being right 50% of the time actually means
that the model is completely unreliable. Of course, we know from looking at the output that
that is most likely not true. Nevertheless, we are going to improve this probability by training
our model.
In the following codeblocks, a simple implementation of the mean squared error loss can be seen.
[2]: y=[Link]([1,0,1,0])
y_predict=[Link]([1,1,1,1])
mean_squared_error(y, y_predict)
[2]: 0.5
As we discussed before, the optimizer’s task is to find the optimal model state in order for it to
function as an appropriate approximator of the original function. In this case, the measurement
we’re using is the MSE. Therefore, the job of the optimizer is to somehow iteratively reduce the
value of the MSE.
By looking at the table of data, we know that the model has actually made two mistakes:
So the goal of the optimizer is to change the connection weights and biases in some way that the
neural network outputs 0 for these inputs. Of course, changing the weights could change the
outputs for the correct predicted outputs.
In order to optimize this process, we will have to look at this problem from a different angle.
110 CHAPTER 5. MACHINE LEARNING
On Figure ??, a figure of the neural network structure can be seen with individual weights and
biases.
The output is the product of the input, and the transformation occurring in the neural networks. This
transformation can only be changed by changing the weights and biases. Therefore, the loss of the
whole neural network can be expressed with the following function:
In order to calculate the amount of difference changing the w2 weight would bring, the partial
derivative ∂L as to be calculated.
∂w2 h
At this point, the calculation becomes problematic, because actually calculating this value by
hand, or by a computer actually takes a lot of resources. Therefore, we are going to use a very
nice formula of calculus, the chain rule. The chain rule decomposes composite differentiable
functions.
Using the chain rule on this partial derivative creates the following decomposed derivatives:
L ∂ŷ
∂L
∂ ŷ
∂w2 = ∗ ∂w2
∂L
− ( − )
∂ŷ = 2 1 ŷ
∂ŷ
We have successfully calculated one part of the equation, but remains to be calculated.
∂w2 r
In order to calculate this component of the derivative, we have to know what the output of the
network is. As discussed before, this can be calculated by ŷ = f(w5 ∗ h1 + w6 ∗ h2 + b3). The
w2 weight only affects the neuron h2. Therefore, we can calculate ∂Lŷ in the following way:
∂ŷ ŷ h2
∂∂h ∂w
∂w2 = 2∗ 2
Where the first term can be calculated using the derivative of ŷ.
∂ŷ
w6 f′ (w3 ∗ x1 + w4 ∗ x2 + b2)
∂h2 = ∗
Where the derivative function f′() is the derivative of the sigmoid function. One important
attribute of activation functions is that they are always easily differentiable. The reason is that
during training, the function has to be derived a huge number of times. Therefore, to save time
and computing requirements it is only logical to choose functions which can be differentiated
easily.
The derivative of the sigm(x) = 1 unction is sigm′(x) = f(x) ∗ (1 − f(x)) We can calculate
1+e−x
f
the second term of the previous decomposed derivative using h2 = f(w3 ∗ x1 + w4 ∗ x2 + b2):
∂h2
x2 f (w3 ∗ x1 + w4 ∗ x2 + b2)
∂w2 = ∗ ′
Backpropagation
Backpropagation [9] is used by gradient-based optimized methods. The idea behind this method
is to use the outcomes of the function to propagate back the importance of each parameter. The
parameter importance (weight) is propagated back by calculating the partial derivatives
backwards towards the input layer from the output.
The stochastic gradient descent (SGD) algorithm uses the same ideas and formulas as the gradient
descent algorithm. The difference is that while gradient descent calculates the gradient for all
possible training input values, the stochastic version randomly chooses one data row to
optimize the whole network on. Therefore, the stochastic gradient descent is significantly faster.
However, since only one input value is included during the optimization, it also takes
significantly more epochs to train the network.
Training with the stochastic optimizer is also volatile, because the choice of input can significantly
change the efficiency of the optimization process. As a side effect of this volatility. multiple runs
could produce different accuracies. The random nature of this method can also prove useful in
certain scenarios, as the volatility can be used as a way to more effectively find a global optimum.
This effect is attributed to the random jumps in loss when optimizing with certain values in the
data set that might not follow the general shape of the data set.
The mini-batch method was born as a combination of the gradient descent, and stochastic vari
ant. The methods pick a random batch of data from the input rows. The number of elements
in the batch is not fixed, and can usually be set. Setting this parameter to 1 results in a stochastic
gradient descent, while setting it to the number of rows resulting the gradient descent optimiza
tion algorithm.
The mini-batch gradient descent is one of the most widely used optimizer algorithms in ma -
chine learning. It provides a good middle ground between the volatile SGD, and the slow
gradient descent algorithm. The gradient descent algorithm is still more reliable, but numerous
experiments have shown that the added stability does not increase the learning tremendously.
The benefit of using mini batching instead of stochastic is the lack of random movement in the
loss during training. Mini-batching provides a relatively fast and stable optimization process.
5.4. MODEL OPTIMIZATION METHODS 113
Learning rate
Learning rate is a variable used during the of the learning process. It is either fixed at the start
of the learning process, and not changed thorough the optimization, or changed dynamically
to fit the need of the optimization algorithm. It is a measure of how fast the algorithm should
train the network, and it changes the amount of change to weights after each epoch.
In order to train the model, the optimizer has to change the weights. The learning parameter is
a multiplier of which the partial derivative is multiplied with. The resulting value is then
subtracted from the original weight value in order to calculate the new weight.
∂L
wi = wi − η ∗ ∂w
i
class Neuron:
The ANN class is extended with different weights and biases. A train method is introduced
which can be used to iteratively optimize the weights of the network based on the SGD optimizer
algorithm. The training values are output during the training, which enables the user to analyze
the training results mid-run.
A history is returned from the training, which contains the accumulated indexes, and the loss
values.
[5]: class ANN:
def __init__(self):
self.w1 = [Link]()
self.w2 = [Link]()
self.w3 = [Link]()
self.w4 = [Link]()
self.w5 = [Link]()
self.w6 = [Link]()
self.b1 = [Link]()
self.b2 = [Link]()
self.b3 = [Link]()
self.h1=Neuron([self.w1, self.w2], self.b1, sigmoid)
self.h2=Neuron([self.w3, self.w4], self.b2, sigmoid)
self.o=Neuron([self.w5, self.w6], self.b3, sigmoid)
learn_rate = 0.1
epochs = 1000
hist=[]
for epoch in range(epochs):
for x, y_true in zip(data, y):
sum_h1 = [Link][0] * x[0] + [Link][1] * x[1] +␣
↪→[Link]
h1 = sigmoid(sum_h1)
sum_h2 = [Link][0] * x[0] + [Link][1] * x[1] +␣
↪→[Link]
h2 = sigmoid(sum_h2)
sum_o1 = [Link][0] * h1 + [Link][1] * h2 + self.o.
↪→bias
o1 = sigmoid(sum_o1)
y_pred = o1
5.4. MODEL OPTIMIZATION METHODS 115
if epoch % 10 == 0:
y_preds = np.apply_along_axis([Link], 1, data) loss
= mean_squared_error(y, y_preds)
print("Epoch %d loss: %.3f" % (epoch, loss))
[Link]([epoch, loss])
return [Link](hist)
y = [Link]([ 1, 0, 1, 0, ])
[8]: hist=[Link](hist)
To create, and run such neural network structure in a modern python library, only a few lines of
code is required . The code block below shows the creation, training, and evaluation of a neural
118 CHAPTER 5. MACHINE LEARNING
network consisting of a hidden layer with 50 neurons, and one with 100.
There are two types of feed forward neural network implementations in Scikit-Learn: MLPRe-
gressor and MLPClassifier. The MLPRegressor can be used to learn regression problems, while
MLPClassifier is used to learn classification tasks. It is always a good idea to scale the original
data, as the gradient descent algorithm does not work well with features of different scales.
[2]: X, y = datasets.load_diabetes(return_X_y=True)
X_scaled=StandardScaler().fit_transform(X=X)
The model can be changed easily by passing different parameters to the MLPRegressor class.
There are a number of different optimizers, activation functions available to experiment with.
The Scikit-Learn package does not support the individual setting of activation functions, or
different layer types. Keras is recommended for implementing advance neural network struc
tures.
[4]: model = MLPRegressor(max_iter=1000, hidden_layer_sizes=(50,100)).fit(X_train,␣
↪→y_train)
5.4.3 Prediction
After training, the Artificial Neural Network can be used to predict new data. This can be
achieved by propagating the input parameter through the network. The output of this function
should be the predicted value.
Example 5.4.4. The scoring mechanism of both neural network implementations will be presented
in this example.
0.9217925143269068
0.3532494722460966
0.338036614190695
Following up on the previous MLPRegressor example, we’ll inspect the predictio n method of
the model. In order to predict with the created model, the model’s predict() function has to be
5.4. MODEL OPTIMIZATION METHODS 119
used. The function work with a matrix of input values by default. In order to predict only one
row of data, the data has to be reshaped.
[6]: array([165.90817188])
The score() function returns the accuracy of the model. It can be seen easily that the model did
not learn the data very well. Achieving a score of 0.5 is the equivalent of random choice. You
can experiment with the model by setting different parameters on the MLPRegressor. The list
of available parameters can be read up on the MLPRegressor documentation.
5.4.4 Overfitting
Increasing learning time and learning data might sound like an intuitive way of increasing
model performance. Although increasing the epochs can decrease loss over time, the model
might not be better at actually predicting the true outcomes of unseen data. This can be
explained by model overfitting. As we have discussed before, when a model is trained with data,
it is actually trying to approximate some hidden function which connects the data points in the N
dimensional space. If we try to learn this data with linear regression, we are trying to fit a straight
line in the N dimensional space that best fits the data points.
This process is the same with all supervised learning models, but instead of a straight line, we are
trying to fit a high-dimension function onto the data points in the same dimension. The data points
might describe a general angle, and shape in which the data points were recorded and labelled.
It is easy to see, that data points outside the general shape of the data points can occur. Overfitting
appears when the model is trained rigorously to replicate the exact shape of the input data set.
The approximated function will be very complex, and would be able to describe input values which
lay in the shape of the original dataset. However, if an over fit model is given a data point that is an
outlier with respect of the trained data set, the model will most probably fail to predict the output
with small error.
Overfitting is a serious issue during model training, as it will decrease model generalization
ability in the prediction phase. There are methods which decrease overfitting, but it is a problem
that is hard to get rid of. In short, training the model to exactly follow the original data set
decreases its ability create a general function approximation that is capable of correctly predicting
unseen data.
Dropout
One of the simplest, and most effective ways to avoid overfitting is to utilize a special technique
called Dropout layer. A dropout layer consists of a probability p, and is inserted before a given
fully connected layer. The following layer’s neurons have p probability to drop out of the next
training epoch. The reason for overfitting is can be often attributed to the the over -reliance on
certain neurons in the networks. Neurons could accumulate huge amounts of weight, while the
others slowly decrease in importance. This could lead to a situation where an input which is
not directly connected to the important neuron could generate high amounts of loss.
By dropping neurons in the layer randomly, it decreases the chance of such over reliant neuron
appearing during training. It is important to only use dropout during training, as predicting with
a scarce network can lead to bad prediction values.
120 CHAPTER 5. MACHINE LEARNING
To combat the anomalies that might occur in a dropout layer, the values in the connected
feedforward layer are multiplied with the dropout probability p in the prediction phase.
Regularization
Another useful technique to use is called regularization. Regularization uses mathematical func
tions used regularly in the subject of statistical analysis. The idea behind regularization is to
discourage the learning of complex inferences in the model by introducing coefficients during
the optimization.
Lasso Regularization
There are two types of which we’ll discuss: L1(Lasso), and L2(Ridge) regularization. L1
regularization takes the loss of the function, and adds a λ||x|| value to it, which is the regularization
term. The absolute value of the weights is multiplied with a λ regularization parameter. This
parameter serves as a static value, set at the start of the training. L1 regularization often serves as a
term to decrease the values of less important features. L1 is capable of reducing weights to 0, which
effectively deactivates certain neurons.
Ridge Regularization
L2 regularization effectively uses the same formula as L1, but takes the square of the absolute weight
values. The formula for L2 regularization is λ||x||2, where ||x||2 is the square of the second normal
of the weights multiplied with the regularization parameter λ L2 regularization achieves the same
result as L1 regularization, but cannot reduce neuron connections to 0 weights. For this reason, L2
regression is used more often, as the deactivation of neurons is not a goal of regularization in most
cases.
Example 5.4.5. In Scikit-Learn’s neural network implementation, only the L2 regularization term
can be used during training. Dropout, or other regularisation are not supported as of writing
this lecture notes.
The regularization term can be changed by passing different values to the alpha parameter of the
MLPRegressor.
[2]: X, y = datasets.load_diabetes(return_X_y=True)
X_scaled=StandardScaler().fit_transform(X=X)
E:\Anaconda\lib\site-
packages\sklearn\neural_network\_multilayer_perceptron.py:571:
ConvergenceWarning: Stochastic Optimizer: Maximum iterations (1000) reached and the
optimization hasn't converged yet.
% self.max_iter, ConvergenceWarning)
[5]: 0.4552609095081024
[6]: array([151.56398853])
5.5.1 Clustering
Clustering is the process of selecting certain elements in the data set, and assigning classes,
based on a clustering function. The classes are not set previously on the elements. The require
ment on clustering algorithms is that for a given cluster, all elements must be more similar to
each other in some way, than the elements of other clusters. Clustering is not an algorithm,
but rather a collection of methods, which are capable of calculating similarity, and assigning
classes to similar elements.
122 CHAPTER 5. MACHINE LEARNING
Clustering algorithms often use the density function of features in order to determine different
output classes. Another way of calculating similarity is using distance functions. Distance in this
topic can be any measurement that can be used to measure similarity.
Dendogram
Dendograms are trees representations which are often used in hierarchical clustering. A den dogram
can visualize how the classes are made in hierarchical clustering.
Question 93.
What are the two main categories of machine learning algorithms?
Question 94.
Can machine learning algorithms make intelligent decisions?
Question 95.
What are the two main categories of machine learning algorithms?
Question 96.
What are the two main requirements of modern machine learning algorithms?
5.5. UNSUPERVISED LEARNING 123
Question 97.
What is the goal of supervised learning?
Question 98.
What is the difference between supervised learning and unsupervised learning algorithms?
Question 99.
What is the biological equivalent of supervised learning?
Question 100.
What are the requirements of data for a supervised learning problem?
Question 101.
How does a machine learning algorithm quantify error?
Question 102.
What is the process of learning in a machine learning algorithm?
Question 103.
What is the loss value if the original function, and the approximated function are identical?
Question 104.
Can you easily explain the processes behind a complex machine learning system?
Question 105.
Why do you need to split the data before learning with machine learning algorithms?
Question 106.
What is the purpose of the training, testing and validation sets?
Question 107.
Why does the data have to be shuffled before splitting?
Question 108.
Why does the data have to be shuffled before splitting?
Question 109.
Why is the splitting size important for the machine learning algorithm?
Question 110.
What is linear in the linear regression algorithm?
Question 111.
What kinds of problems can the linear regression problem solve?
Question 112.
What are the most common loss functions used for regression problems?
124 CHAPTER 5. MACHINE LEARNING
Question 113.
How can you tell that the linear regression algorithm is not a good model for the data?
Question 114.
What category of problems can the logistic regression solve?
Question 115.
What is the logistic function?
Question 116.
What variables are trained during logistic regression training?
Question 117.
What type of logistic regression has two types of target values?
Question 118.
What is the difference between a perceptron and an artificial neuron?
Question 119.
How does the perceptron calculate its output?
Question 120.
What kinds of activation functions can be used in neurons?
Question 121.
How are artificial neural networks built?
Question 122.
What is a neural network layer?
Question 123.
What types of layers there are in multilayer neural networks?
Question 124.
How many neurons can a layer contain?
Question 125.
How many layers can a hidden layer contain?
Question 126.
How are weights calculated from neural network prediction loss?
Question 127.
What types of gradient descent are there?
Question 128.
What is the learning rate of the gradient descent algorithm?
5.5. UNSUPERVISED LEARNING 125
Question 129.
What types of learning rate are there?
Question 130.
Which class can be used to create multilayer artificial neural network in Scikit-learn’s library?
Question 131.
What are the dangers of overfitting?
Question 132.
What techniques can you use to avoid overfitting?
126 CHAPTER 5. MACHINE LEARNING
Bibliography
[1] Martín Abadi et al. “Tensorflow: A system for large-scale machine learning”. In: 12th
{USENIX} symposium on operating systems design and implementation ({OSDI} 16). 2016, pp. 265-
283.
[2] Hervé Abdi and Lynne J Williams. “Principal component analysis”. In: Wiley interdisci
plinary reviews: computational statistics 2.4 (2010), pp. 433-459.
[3] Horace B Barlow. “Unsupervised learning”. In: Neural computation 1.3 (1989), pp. 295-311.
[4] Rich Caruana and Alexandru Niculescu-Mizil. “An empirical comparison of supervised
learning algorithms”. In: Proceedings of the 23rd international conference on Machine learning.
2006, pp. 161-168.
[5] Francois Chollet. Deep Learning with Python. 2017.
[6] Gene H Golub and Christian Reinsch. “Singular value decomposition and least squares
solutions”. In: Linear Algebra. Springer, 1971, pp. 134-151.
[7] Jiawei Han, Jian Pei, and Micheline Kamber. Data mining: concepts and techniques. Elsevier,
2011.
[8] David J Hand and Niall M Adams. “Data Mining”. In: Wiley StatsRef: Statistics Reference
Online (2014), pp. 1-7.
[9] Robert Hecht-Nielsen. “Theory of the backpropagation neural network”. In: Neural net-
works for perception. Elsevier, 1992, pp. 65-93.
[10] Surya R Kalidindi et al. “Application of data science tools to quantify and distinguish
between structures and models in molecular dynamics datasets”. In: Nanotechnology 26.34 (2015),
p. 344006.
[11] Huan Liu and Hiroshi Motoda. Feature extraction, construction and selection: A data mining
perspective. Vol. 453. Springer Science & Business Media, 1998.
[12] Wes McKinney et al. “pandas: a foundational Python library for data analysis and statis
tics”. In: Python for High Performance and Scientific Computing 14.9 (2011).
[13] Ben Noble, James W Daniel, et al. Applied linear algebra. Vol. 3. Prentice-Hall Englewood
Cliffs, NJ, 1977.
[14] Travis E Oliphant. A guide to NumPy. Vol. 1. Trelgol Publishing USA, 2006.
[15] Sankar K Pal and Sushmita Mitra. “Multilayer perceptron, fuzzy sets, classification”. In:
(1992).
[16] Fabian Pedregosa et al. “Scikit-learn: Machine learning in Python”. In: The Journal of ma-
chine Learning research 12 (2011), pp. 2825-2830.
[17] George AF Seber and Alan J Lee. Linear regression analysis. Vol. 329. John Wiley & Sons,
2012.
[18] Donald F Specht et al. “A general regression neural network”. In: IEEE transactions on
neural networks 2.6 (1991), pp. 568-576.
[19] Sandro Tosi. Matplotlib for Python developers. Packt Publishing Ltd, 2009.
127
128 BIBLIOGRAPHY
[20] Lloyd N Trefethen and David Bau III. Numerical linear algebra. Vol. 50. Siam, 1997.
[21] Wil Van Der Aalst. “Data science in action”. In: Process mining. Springer, 2016, pp. 3-23. [22] Jake
VanderPlas. Python data science handbook: Essential tools for working with data. " O’Reilly
Media, Inc.", 2016.
[23] Svante Wold, Kim Esbensen, and Paul Geladi. “Principal component analysis”. In: Chemo-
metrics and intelligent laboratory systems 2.1-3 (1987), pp. 37-52.
[24] Raymond E Wright. “Logistic regression.” In: (1995).