0% found this document useful (0 votes)
33 views128 pages

Applied Data Analysis

The document provides an overview of Applied Data Analysis, detailing the field of Data Science, its tasks, challenges, and application areas including engineering, finance, and research. It discusses the importance of data processing tools, descriptive statistics, data mining, and machine learning techniques, highlighting the roles of mathematicians, computer scientists, and domain experts in the data science process. Additionally, it emphasizes the significance of technical reports in documenting data science methodologies and findings.

Uploaded by

NemMondomMegEzt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views128 pages

Applied Data Analysis

The document provides an overview of Applied Data Analysis, detailing the field of Data Science, its tasks, challenges, and application areas including engineering, finance, and research. It discusses the importance of data processing tools, descriptive statistics, data mining, and machine learning techniques, highlighting the roles of mathematicians, computer scientists, and domain experts in the data science process. Additionally, it emphasizes the significance of technical reports in documenting data science methodologies and findings.

Uploaded by

NemMondomMegEzt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Applied Data Analysis

Bence Bogdandy
August 5, 2020
2
Contents

1 Data Science 7
1.1 What is Data Science 7
1.2 Tasks and Challenges . . . 8
1.3 Application Areas . . . 8
1.3.1 Engineering . . . 8
1.3.2 Finance . . . 9
1.3.3 Research . . . 9
1.3.4 Other Fields . . . 9
1.4 Participants 10
1.4.1 Mathematicians . . . 10
1.4.2 Computer Scientists 10
1.4.3 Domain Experts . . . 10
1.5 Technical Report . . . 11

2 Data Processing Tools 13


2.1 Microsoft Excel 13
2.1.1 Myths and Misconceptions 13
2.1.2 Limitations 13
2.2 Matlab and R 14
2.2.1 Matlab . . . 14
2.2.2 R . . . 14
2.3 Python 14
2.4 Package Management 15
2.4.1 pip 15
2.4.2 Anaconda . . . 16
2.5 Development Platforms for Python. . . 16
2.5.1 PyCharm . . . 17
2.5.2 Jupyter Notebook . . . 17
2.6 Scientific Packages 17
2.6.1 numpy . . . 18
2.6.2 matplotlib 19
2.6.3 pandas . . . 22
2.6.4 seaborn . . . 27
2.6.5 scikit-learn ... 28
2.6.6 tensorflow . . . 34
2.6.7 Keras . . . 37

3
4 CONTENTS

3 Descriptive Statistics 45
3.1 Randomness . . . 45
3.2 Statistical Data Types . . . 46
3.2.1 Categorical 47
3.2.2 Numeric . . . 49
3.3 Statistical Measurements . . . 51
3.3.1 Quantile 52
3.3.2 Variance 52
3.3.3 Standard Deviation . . . 52
3.3.4 Covariance 52
3.3.5 Correlation 53
3.3.6 Skewness 53
3.3.7 Kurtosis 54
3.4 Distributions . . . 54
3.4.1 Bernoulli Distribution 54
3.4.2 Uniform Distribution . . . 57
3.4.3 Binomial Distribution 59
3.4.4 Gaussian Distribution 62
3.4.5 Poisson Distribution 64
3.4.6 Exponential Distribution ... 65

4 Data Mining 71
4.1 Data Structures 71
4.1.1 Data set 73
4.2 Data Mining . . . 76
4.3 Data Mining Process 77
4.3.1 Problem definition 77
4.3.2 Data Selection . . . 77
4.3.3 Data Loading 77
4.3.4 Principal Component Analysis 80
4.3.5 Singular Value Decomposition 83
4.3.6 Feature Engineering 85
4.3.7 Feature Crosses . . . 86

5 Machine Learning 89
5.1 Supervised Learning 90
5.2 Training and Loss . . . 90
5.2.1 Train-Test split 91
5.2.2 Linear Regression . . . 92
5.2.3 Logistic Regression . . . 96
5.3 Training Process . . . 100
5.3.1 Perceptron . . . 100
5.3.2 Neural Networks . . . 104
5.4 Model Optimization Methods . . . 108
5.4.1 Optimization Process . . . 109
5.4.2 Gradient Descent . . . 112
5.4.3 Prediction 118
5.4.4 Overfitting . . . 119
5.5 Unsupervised Learning 121
CONTENTS 5

5.5.1 Clustering 121


5.5.2 Hierarchical Clustering 122
5.5.3 Centroid-based Clustering . . . 122
5.5.4 Density-based clustering . . . 122
6 CONTENTS
Chapter 1

Data Science

The acquisition of data and information has always been a very important concept in human
civilization. Collection, transformation and manipulation of accumulated data enables the brain to
make complex decisions. Although these processes are acquired naturally, the brain is only capable
of processing small amounts of data at a time.

Machines have always been used to help humans and increase productivity. Computers have been
invented in order solve calculations effectively. Data science is a modern extension of the human
thinking process. The Latent potential of massive scale data collection has been quickly realized
by companies and researchers. Since the beginning of the information age, enormous amounts of
data has been accumulated.

With the help of Data Science, hidden truths and information that be uncovered among the vast array
of data.

1.1 What is Data Science


Data Science [21, 22] is a scientific field which contains processes used to, for example collect,
transform, manipulate and filter data in order to extract relevant information. Data Science is an
interdisciplinary field, containing elements of mathematical statistics and computer science, along
with minor elements from other fields.

The process of gathering information is only partly contained in the field of data science. These
processes change as technology evolves, and are mostly a by-product of various modern
technologies. After the information has been obtained it is usually formatted and visualized. New
knowledge and information can be derived from the newly formatted data set, which is then used
to analyze the data set, or machine learning models.

The abstract processes described above are made up of descriptions, mathematical formulae and
algorithms.

7
8 CHAPTER 1. DATA SCIENCE

1.2 Tasks and Challenges


Data Science aims to create easy-to-understand models which can extract and describe certain
attributes of a data set. The requirement for such models often requires the transformation of
data.
This procedure is called pre-processing. This step in not negligible, as the efficiency of later model
building steps and algorithms often depend on the performance of pre-processing
The process of analyzing and transforming data often contains the scientific process of trial and error.
This means that the model most often requires many attempts to perform the task well and be
finalized. The reason for these attempts is that the steps required for a well-built model are often
hard to spot, or otherwise unclear.
Data visualization is an important domain of data science. The resulting models should contain
easy to read descriptive statistics and visualization.

1.3 Application Areas


Data has always played a very important role in human life. Primitive data processes have
always been the basis on which we acquire, process, and share information. Applying collected
data and knowledge has recently become a separate field of science [iqbal2020big]. Data science
is the collective field that encompasses measurement techniques, data storage, data transforma
tion techniques, data filtering, and machine learning among more. Classic data science usually
involved statisticians, economists, and scientists from other fields of applied mathematics.
Modern applications usually involve the use of computer science in order to provide the needed
computing power. Computers provide an essential and effective tool to process very large scale
data. The reason why data science applications in the past have been primitive was the lack of
computing power. As computers gained became more powerful over the years and more storage
area to store data in, data processing has been recognized as one of the most prominent and
important parts of modern sciences. Numerous software and programming frameworks have been
created by computer scientists to not only make data processing more effective, but more easy to
do as well. These applications revolve around the processing of enormous databases, or continuous
processing of incoming data. Computers can also be used to create very complex models and
algorithms, which would otherwise be incalculable for humans. These cases can be found in both
modern industry and numerous other fields of science.

1.3.1 Engineering
Modern production generates large amounts of data besides the products themselves. This surplus
of data can be processed in order to extract useful information related to the production. Knowledge
is used in order to optimize the manufacturing process.
Certain attributes of a specific production system can influence the resulting product. These
attributes can be used to create recommendations for engineers. Engineers use these recom
mendations to fine-tune the manufacturing process, in order to create specific variations on the
outcome.
A new field of machine learning called Generative modelling has recently gained ground as one
of the most promising fields of machine learning engineering. Generative models take an input,
1.3. APPLICATION AREAS 9

and try to produce a slightly different output. For example, a given 3 dimensional data of a tool, or
part, could be optimized by removing unnecessary structural elements.

1.3.2 Finance
The finance industry was one of the first to start collecting large-scale data sets, which support the
backbone of most modern day financial decisions. Of course, in the finance industry the end goal
is always the biggest monetary gain. Therefore, finance has always been one of the biggest
supporters, and job providers of data science. Data is valuable in the financial world; it provides
opportunity to improve one’s financial position on the market. Early data scientists were mostly
statisticians, or financial mathematicians, who specialized in analyzing, visualizing data or predicting
certain outcomes.
Modern banking systems employ a large number of data scientists who are capable of creating
complex models for outlier detection, and intelligent predictions. Recently, data science applications
have become so adept at managing huge amounts of data and providing reliable information that
it has replaced a huge number of human responsibilities. This is, of course, a huge boon for the
financial world, as models are capable of providing recommendations and information on data that
could have not been analyzed before.

1.3.3 Research
The process of observation is a data collection method in itself. The process of capturing this
information has always been an important aspect of science. The methods of saving these data
has been different throughout history. A few decades ago, writing down the results of certain
experiments were the only adequate solutions to successfully capture results and natural phe
nomena.
In modern times, technology is much more adept at capturing large amounts of data seamlessly
than human senses and memory. Many of the modern scientific fields have evolved in the past
10 years to more closely incorporate the data processing tools. Modern research often relies on
the modelling of huge data sets. In research, the application of the captured data can differ from
field to field, but the method of processing hardly changes. In most cases, data is used to either
prove, or disprove, given hypotheses using statistical methods or machine learning.. Medical
science has particularly benefited from data science applications. Modern machine learning
applications can surpass humans in disease detection, and can provide recommendations for
medical doctors. These kinds of technologies increase public welfare, and decrease the stress
and point of errors for medical practitioners .

1.3.4 Other Fields


Data is observed, and captured in almost all aspects of our lives. Using any technological device
almost certainly records, and analyzes our behaviour and makes intelligent decisions based on
it. This constant stream of data can be distracting, and hard to ignore at times. Nevertheless, col
lected data can be used to improve security of modern systems, or society in general. Collected
data can also be used by companies for monetary purposes as well. Data is used for personal
advertisements, which recommends to the user items to buy, which shows to watch as well as many
other
applications.
Captured data can also be used to create applications which are used as a companion to every-
day life. The cell phone has evolved into a type of technological companion that somebody from
10 CHAPTER 1. DATA SCIENCE

the last century would not have even dreamed of. It is capable of measuring numerous input data
in order to streamline the user’s experience These applications can be trained to remember, and
suggest, actions based on past behaviors, mirroring intelligence.

1.4 Participants
As data science is an interdisciplinary fielda solution to a data science problem is rarely done by a
single contributor. Multiple different roles are set in order to effectively find the solution to a
specific problem.
In an industrial environment, there is usually a dedicated research and development team who
performs these tasks. The particular roles and participants of a data science group are discussed
in this section. In this lecture note we will focus on the job of the computer scientist, or data
scientist.

1.4.1 Mathematicians
Mathematicians are utilized in order to analyze the data. This analysis reveals underlying attributes
and details. This newly revealed knowledge can be used in order to create mathematical models
which fit the particular problem.
Determining the correct model can also be the mathematician’s role. The model to be used
usually requires specific modifications to the data set. Determining these transformations can
effectively create processable data, as well as models.

1.4.2 Computer Scientists


Computer scientists construct programs and apply prior knowledge revealed by the mathe
maticians. Computer scientists have slowly evolved into a special kind of expert, which is a
combination of statistics, computer science and special data knowledge. This new field is called
data science, and it’s participants data scientists. Raw collected data is usually poorly formatted,
erroneous and requires work in order to be used as a data set. The job of a data scientist starts
with formatting, and processing data into feature sets. Features represent formatted data that
can be easily read, and processed by computers, and modern data science applications. Pre-
processing usually consist of transformations which modify the data structure or values into
formats that the computer can quickly process. These modifications are often ruled by the
required target data model. Another pre-processing goal is to transform the data into a more
cohesive and understandable format.
After the data has been formed into a cohesive data set, a set of statistical attributes are mea
sured. These attributes are used for the analysis of the data set, and the identification of poten
tial errors. After the iterative correction of a data set, data mining or machine learning models
can be built.
Prototypes are built to test, and measure the efficiency of the algorithms.

1.4.3 Domain Experts


Mathematicians and computer scientists often only understand their own domain. Creating
applications for other fields can be problematic, as the correctness of a mathematical formula,
1.5. TECHNICAL REPORT 11

or computer algorithm can’t be checked easily. Domain experts have the knowledge required to
build consistent, and intelligent systems in their own fields.

As the created models and systems often require knowledge of certain fields, domain experts
often evaluate the inner workings of the algorithms. Without a way to check whether or not an
algorithm or system is working correctly, the application cannot be proven to be correct.

1.5 Technical Report


Technical reports exist in most fields of science. In most fields, a technical report contains practical,
and applicable research details and use cases. For data science, a technical report is a paper which
contains the practical and detailed descriptions of certain models and data sets.
These reports usually include the goals of the study, and the analyzed data set. Technical details may
include:
• Detailed explanation, and description of the data set.
• Pre-processing steps, and effects.
• Iterative process steps of data transformation.
• Description of the used model.
• The explanation of consequences the model holds.
• Discussion.
Technical reports might be different for different industries, or the science community. Scientific
papers are usually focused on scientific novelty, and proof of concept systems. Industrial
documentation however is usually a detailed technical overview of systems that describe the novel
application of existing theories and systems.

Question 1.
What fields of science contributed to the creation of modern data science?

Question 2.
When was modern data science created?

Question 3.
What kind of tasks does a data scientist have?

Question 4.
In which industries is modern data science incorporated into the development process?

Question 5.
How does data science increase the productivity of engineering systems?

Question 6.
Why are mathematicians an essential part of the data science process.
12 CHAPTER 1. DATA SCIENCE

Question 7.
What essential participants and roles does data science have?

Question 8.
Which field of applied science helped a specific part computer science evolve into data science?

Question 9.
Who is the domain expert in the data science process, and what responsibilities does the job
have?

Question 10.
Why is it important to create a technical report of the data science process?

Question 11.
What do you think are some types of technical reports used in the industry, economics or sci
ence?
Chapter 2

Data Processing Tools

Most software for analyzing and processing data have been in development before the data science
boom of the 2010s. Although these types of software are still relevant today, a great number of
different tools have been developed since [10, 22]. These tools are designed to be used on very large
data sets that older software tools might not be able to handle.
Nevertheless, there are a number of different software which can be used to effectively analyze and
process large amounts of data and build intelligent models.

2.1 Microsoft Excel


Microsoft Excel is a spreadsheet software, which has enjoyed industry standardization throughout the
years. Excel is part of the Microsoft Office family of software, which is targeted to be a package of
useful, everyday software tools.

2.1.1 Myths and Misconceptions


As it is with most software developed by Microsoft, Excel is an excellent all-round tool, which is
very limited in complex data processing.
It is perfect for calculating statistical values of simple two-dimensional data sets. Since the
industrial standardization, it has also become standard knowledge which makes it highly
accessible for many people outside of informatics and computer science.
Excel supports custom routines and functions written in the Visual Basic programming lan
guage.

2.1.2 Limitations
While it is a perfectly viable option for tacking easy-to-handle, well-formatted and generally
accessible data sets, the limitations of the tool start to show as the complexity increases.
Excel does support the import of csv data sets, which is usually the format of choice in data
sciences. However, depending on the running hardware, Excel might be unable to load larger
datasets, crashing in the process.

13
14 CHAPTER 2. DATA PROCESSING TOOLS

Complex data analyzing, and pre-processing algorithms are supported by third party libraries,
or by programming it in visual basic. However, as Excel is unable to visually represent complex
and large data sets, these algorithms might prove to be ineffective, slow and confusing.
In conclusion, Excel is a perfect tool to calculate statistical values on smaller data sets, provides
easy access to simple data transformations on data, and is capable of converting data to different
formats.
However, Excel might be ineffective on more complex data sets, and data transformation tasks.

2.2 Matlab and R


Both Matlab and R are developed to be used for numerical, and statistical computing respectively.
Both platforms contain highly technical, and easy to understand implementations of algorithms in
their respective fields of computing. However, neither Matlab or R is designed to be used as a general
programming language.

2.2.1 Matlab
Matlab was developed to be used as a combination of proprietary programming language and
environment. The language contains high-level representations of numerical data types, such
as arrays and matrices. Operations between these data types are integrated into the language,
letting the user focus on the creation of new applications without reinventing the wheel.
Matlab supports 2, and 3 dimensional data visualization with plotting, charts and heat maps
among other tools. Matlab can be easily extended with optional toolboxes, which contain imple
mentations for a general, or specific field of computing. The programming language is weakly
typed, and supports multiple programming paradigms such as structured, and objective ori
ented programming.
The most problematic aspect of Matlab is that its programming environment and language don’t
have a free option, severely limiting availability.

2.2.2 R
Much like Matlab, R is a programming language and tool set developed for mathematical analysis
and computing first. The difference is that R is an open source and is part of the GNU Project.
R is specifically aimed at statistical computing.
R contains high level statistical computing capabilities, including the Data Frame type, which has
spread as the data type for choice since its inception. R contains tools for descriptive statistics, linear
and non-linear modelling and basic machine learning capabilities. One of the focus during R’s
development was the well-realised plotting capability, which can be used to create figures for
scientific use.

2.3 Python
The Python programming language, developed by Guido van Rossum first appeared in 1990. The
language began gaining attention after Python 2.0 was released in 2000.
2.4. PACKAGE MANAGEMENT 15

Python is an interpreted script language, which is capable of running compiled programs, or even
compiling them in runtime with extensions. The language was developed with the men tality of
being as easy to understand and straight forward as a programming language can possibly be.
It supports different programming paradigms, and therefore it is called a multi-paradigm
programming language.
Structured programming is supported, as well as the creation of classes, and other elements of
the object oriented programming paradigm. Other paradigms are partly supported, namely
functional, aspect-oriented, and logic programming.
Python programs is designed to be ran in a python virtual environment. The virtual environment
provides garbage collection, dynamic typing among other features.
The Python language philosophy is called the Zen of Python.
Example 2.3.1. The following code snippet shows the initialization of variables of various types:

[1]: variable1=12
variable2="Hello!"
variable3=False

Python is dynamically typed; therefore no explicit declaration of the variable’s type is needed. In the
following snippet, all three variables are printed onto the standard output:

[2]: print(variable1, variable2, variable3)

12 Hello! False

2.4 Package Management


Python is capable of dynamically loading multiple different linked software distributions. These
distributions are usually referred to as packages in the python community.
Package managers were developed to be used in conjunction with the Python interpreter. These
software packages contain implementations of data structures, and functions designed to be used
intuitively to solve specific problems.
Python has a wide range of different packages, which can be browsed, and installed with specific
package manager tools, such as pip, or conda.

2.4.1 pip
Pip is a general package, and dependency management tool for the Python programming lan
guage. Packages can be installed into interpreters by using the pip install command in the ter
minal.
If the required package is found, it is downloaded, and installed into the currently active python
interpreter.
It should be noted that multiple different interpreters can be created using the python virtual
environment. When creating these environments, a separate instance of the interpreter is cloned
using a default version of python.
16 CHAPTER 2. DATA PROCESSING TOOLS

1 python3 -m venv env


Listing 2.1: A Python 3 virtual environment creation.

The code snippet found in Listing 2.1 will create a Python 3 virtual environment called env.
A script named activate is usually created in the bin directory of the environment. The interpreter can be
activated by using source activate on this file.
1 source env/bin/activate
2 pip install numpy, pandas

Listing 2.2: Virtual Environment Activation and package installation

On Listing 2.2, the activation of the newly created virtual environment, and the installation of the
Numpy and Pandas packages can be seen.

2.4.2 Anaconda
Anaconda is a combination of a python interpreter, a package manger, and a handful of select
packages, and other useful tools. The Anaconda distribution is intended to be used for data
science and machine learning.

The package manager in this distribution is called the conda. This package manager essentially fills
the same role as pip, with the added feature of environment management. The two package managers
do not use the same repositories to install packages. Therefore, some packages might be
unavailable, or have a different version for the two package managers.
1 conda create --name myenv
2 source myenv/bin/activate
3 conda install numpy, pandas
Listing 2.3: Conda Package Manager

The operations on Listing 2.3 accomplish the same goal as running the code snippet on Listing
2.2 and Listing 2.1 with the conda package manager.
The development packages and tool contained in anaconda will be explored in the following
sections.

2.5 Development Platforms for Python


Python is a really unique language in a sense that there are many options regarding development
environments. Development on the python language can be a tackled by creating script files
which the interpreter runs uninterrupted. This approach can be used for creating well-defined
scripts which can run on computing grids and large servers. However, there is another, more
interactive, paradigm of programming which can be used to program python. Interactivity can be a
big part of programming python, as it uses an interpreter, therefore it can recognize incoming
commands from the developer in run time. In this section, we will go over the available development
platforms for the python language.
2.6. SCIENTIFIC PACKAGES 17

2.5.1 PyCharm
Pycharm is one of the most recognized development platforms for python. It is recommended by
Anaconda as it’s platform of choice. PyCharm is a professional platform, capable of state of-the art
code completion, code generation and other tools you might expect from a modern IDE. However,
as it contains a lot of different tools and components, it also means that it takes a large overhead in
memory, and storage space.

PyCharm can provide almost any tools a developer might require for developing comprehensive
python scripts. It provides interactivity though its emulated terminal, and is able to display plots
interactively. It is also capable of handling different technologies, such as database tools, and web
frameworks. Therefore, PyCharm is an excellent tool for developing scripts that are designed to be
run on a server.

2.5.2 Jupyter Notebook


Jupyter Notebook is a unique development platform which is widely used by data scientists.
Notebook was developed to be a direct implementation of Donald Knuth’s literate programming
[knuth1984literate] development style. The developer is capable of creating blocks of code that
can be run separately in any order. The python interpreter will be active during the devel
opment process, and will keep variables and functions in memory. Therefore, functions can
be developed intuitively, by observing the outcomes of expressions before creating a complex
function.

Literate programming first described an easy-to-use, and intuitive programming interface in


which program descriptions can be made along with the execution code. Notebook implements this
feature by providing text blocks. Modern versions of Jupyter Notebook are capable of
interpreting many different markdown languages in text blocks. HTML can be inserted into text blocks
in order to display different elements, such as tables, lists among many others. LaTeX can also be
interpreted into text blocks, which is helpful in case the developer wants to insert mathematical
formulation, or other hard-to-format text bits.

Jupyter can be run locally, or hosted on a server, which can be accessed from remote locations.
Notebook interfaces have been implemented by most technological companies such as Google and
Microsoft. These companies provide their own servers on which developers can create notebooks,
and run their applications. For the aforementioned reasons, notebooks have become the development
interfaces for creating data science applications.

However, python scripts are still an important part of the ecosystem, as they are easy to run on
remote servers without supervision.

2.6 Scientific Packages


Python contains a lot of different features which are a good fit for data science. It is highly
readable, and interactive thanks to it’s interpreted nature. It also contains many high-level data
structures, and is a capable language for most tasks. However, for data science, it is the packages that
create the proper scientific environment that can be used easily.
18 CHAPTER 2. DATA PROCESSING TOOLS

2.6.1 numpy
Numpy [14] is the most essential scientific computing package for python. Numpy contains
multiple implementations of multi-dimensional arrays and matrices. The implementation of
these data structures is highly optimized for efficient computation. Numpy also contains high -
level mathematical algorithms. The library contains functions for linea r algebra, random num
ber generation among other subjects. Most of the modern data science-focused packages use
the Numpy as a base for numerical calculation. Numpy provides a basis on which scientific
computing exists in Python, and therefore it is essential to learn to use it effectively.
Example 2.6.1. In the following blocks, functionalities of Numpy will be represented through basic
exercises. In the following block, numpy is imported as np. A numpy array is stored in the
variable arr, which is converted from an ordinary python list.

[3]: import numpy as np


l= [1, 7, 4, 3, 2]
arr=[Link](l)

[3]: arr

[4]: array([1, 2, 4, 6, 7])

The following presents built-in functions of the numpy array data structure. The available
functions are much more numerous, than the shown examples.

[5]: [Link](), [Link](), [Link](arr)

The following blocks shows basic matrix data structure creation, and basic built-in-functionality.

[6]: (array([ 1, 3, 7, 13, 20], dtype=int32),


array([ 1, 2, 8, 48, 336], dtype=int32),
array([1, 2, 4, 6, 7]))

[7]: matrix=[Link]([
[1,2,3],
[5,4,6],
[9,8,7]
])

[8]: matrix

[9]: matrix([[1, 2, 3],


[5, 4, 6],
[9, 8, 7]])

[10]: matrix.T

[11]: matrix([[1, 5, 9],


[2, 4, 8],
2.6. SCIENTIFIC PACKAGES 19

[3, 6, 7]])

[12]: [Link]()

[13]: matrix([[1, 2, 3, 5, 4, 6, 9, 8, 7]])

2.6.2 matplotlib
MatPlotLib [19] is the essential plotting library for Python. Much like Numpy, other packages
build on the given functionality of the library.

MatPlotLib is easy to use, and configure. It can be parametrized for highly customized plots. It is
capable of producing animated, and interactive plots as well.

MatPlotLib was designed to be used for publications, and therefore contains implementations
regularly used plotting and visualization methods.

A very important part of the matplotlib package is the collection of methods called pyplot.
Pyplot enables the easy function plotting and functionalities found in MatLab.

Example 2.6.2. In these examples, we’ll take a quick look at some basic functionality of the
matplotlib library.

We’ll start with importing both numpy, and matplotlib. As mentioned in the previous section,
numpy provides much of the basic functionalities of higher level scientific frameworks and li
braries.

[1]: import [Link] as plt


import numpy as np

[2]: x=[Link](1, 10, num=20)

The command x=[Link](1, 10, num=20) generates a linear vector between 1 and 10, containing 20
elements. The 20 elements are evenly spaced in the linear space.

The following commands create a linear line plot, where the x axis represents the indexes, and the
y axis represents the values of the contained elements. The library automatically connects elements,
and creates an easy to read, and understand line plot.

[3]: [Link](x)
[Link]('number indexes')
[Link]('number values')
[Link]()
20 CHAPTER 2. DATA PROCESSING TOOLS

The following two plots achieve the same result, except the plotted functions are logarithmic, and
exponential.

[Link](x), and [Link](x) is used to calculate the logarithmic, and exponential value of each element,
which are used during the plotting.

[4]: [Link]([Link](x))
[Link]('number indexes')
[Link]('logarithmic values ')
[Link]()
2.6. SCIENTIFIC PACKAGES 21

[5]: [Link]([Link](x))
[Link]('number indexes')
[Link]('exponential values')
[Link]()
22 CHAPTER 2. DATA PROCESSING TOOLS

2.6.3 pandas
While Numpy focuses on the algebraic calculations, Pandas sets its focus on the statistical data
structures and operations.

Pandas [12] uses Numpy as a base to create Series, and DataFrame data structures. A Series
corresponds to a series of data in statistical analysis. The elements contain values of a given type,
and are indexed along an axis. Numerous statistical operations can be called on the series easily, such
as filtering, mathematical operations, and sorting.
A DataFrame represents multiple Series indexed along the same axis. This data structure represents
individual measurements that correspond to each other in some way.

The resulting data structure is a two dimensional, tabular data set on which complex analysis can
be run. DataFrame comes with intuitive, and easy-to-use operations of which produce
descriptive statistics, aggregate data or transform the data in some way.
Tabular data, such as Excel, or CSV can be imported easily using built-in methods for DataFrame
conversion. Data can be also exported back to Excel, CSV and other formats such as SQL or
JSON.
Example 2.6.3. The following examples show the basic functionality and strengths of the panda’s
package.
After importing the required packages, a Series is created using [Link]() constructor, with a list as
a parameter. As you can see, the values are indexed starting from 0. The numpy data type of the
series can be seen on the bottom of the output.
2.6. SCIENTIFIC PACKAGES 23

[1]: import numpy as np


import pandas as pd

[2]: s= [Link]([1, 7, 4, 3, 2])


s

[2]: 0 1
1 7
2 4
3 3
4 2
dtype: int64

In the following code blocks, a DataFrame is constructed by using the same matrix from the
numpy examples. Pandas constructs the DataFrame with three columns and three rows. The
names of the columns can be set by passing a list of column names onto the columns parameter.

[3]: df=[Link]([Link]([
[1,2,3],
[5,4,6],
[9,8,7]
]), columns=["First", "Second", "Third"])
df

[3]: First Second Third


0 1 2 3
1 5 4 6
2 9 8 7

[4]: [Link]

The DataFrame contains three series called First, Second, and Third of which are all of type int32.
A DataFrame can contain Series of different types, as expected from tabular data.

[4]: First int32


Second int32
Third int32
dtype: object

The following commands are used most often to examine the given data frame. the [Link]()
function returns the first few rows of data, which prevents pandas from filling up the screen with
data. [Link], and [Link] can be used to examine the different columns, and row indexes of the
data frame.
[5]: [Link]()

[5]: First Second Third


0 1 2 3
1 5 4 6
24 CHAPTER 2. DATA PROCESSING TOOLS

2 9 8 7

[6]: [Link]

[6]: RangeIndex(start=0, stop=3, step=1)

[7]: [Link]

[7]: Index(['First', 'Second', 'Third'], dtype='object')

DataFrames and Series can be converted to numpy data easily, by using the built-in to_numpy()
function.
[8]: df_np=df.to_numpy()
print(df_np)

[[1 2 3]
[5 4 6]
[9 8 7]]
df.T returns the transpose of the tabular data, which swaps the rows for columns in the data
frame.

[9]: df.T

[9]: 0 1 2
First 1 5 9
Second 2 4 8
Third 3 6 7

[10]: df.sort_values(by='Third')

Data manipulation operations can be used easily in Pandas. You can either refer to columns by name,
or by index using the [Link][] method. Regular Python list comprehensions are present in Pandas,
therefore df[0:2] will return the first two rows of the DataFrame. df["Third"] will return the series by
the name of Third.

[10]: First Second Third


0 1 2 3
1 5 4 6
2 9 8 7

[11]: df["Third"]

[11]: 0 3
1 6
2 7
Name: Third, dtype: int32
2.6. SCIENTIFIC PACKAGES 25

[12]: df[0:2]

[12]: First Second Third


0 1 2 3
1 5 4 6

[13]: df[df['First'] > 2]

Data can be filtered using simple logical conditions. For example, df[df[’First’] > 2 will only
return rows where the values of the First Series contain a number higher than 2.

[13]: First Second Third


1 5 4 6
2 9 8 7

Data frame rows and columns can be extended easily with their respective functions. The
[Link]() function adds a new row to the existing data frame. The expression df[’Fourth’] =
[5, 8, 2] will add a new column called Fourth.

[14]: df['Fourth'] = [5, 8, 2]


df

[14]: First Second Third Fourth


0 1 2 3 5
1 5 4 6 8
2 9 8 7 2

Different mathematical operations can be executed easily on data frames by using simple arith
metic. df - 4 will subtract -4 from every element of the data frame.

[15]: df - 4

[15]: First Second Third Fourth


0 -3 -2 -1 1
1 1 0 2 4
2 5 4 3 -2

The following block will transform by calculating the sinus values of each respective element.

[16]: ([Link](df))

[16]: First Second Third Fourth


0 0.841471 0.909297 0.141120 -0.958924
1 -0.958924 -0.756802 -0.279415 0.989358
2 0.412118 0.989358 0.656987 0.909297

As mentioned before, Pandas is used mainly for its implementation of the tabular data frame, and
it’s statistical capabilities.
26 CHAPTER 2. DATA PROCESSING TOOLS

The following code blocks contain examples of easy to use methods for statistical descriptions
of the data.
[17]: [Link]()

[17]: First 5.000000


Second 4.666667
Third 5.333333
Fourth 5.000000
dtype: float64

[18]: [Link]()

[18]: First Second Third Fourth


0 1 2 3 5
1 6 6 9 13
2 15 14 16 15

[19]: [Link]()

[19]: First Second Third Fourth


count 3.0 3.000000 3.000000 3.0
mean 5.0 4.666667 5.333333 5.0
std 4.0 3.055050 2.081666 3.0
min 1.0 2.000000 3.000000 2.0
25% 3.0 3.000000 4.500000 3.5
50% 5.0 4.000000 6.000000 5.0
75% 7.0 6.000000 6.500000 6.5
max 9.0 8.000000 7.000000 8.0

The pd.read_csv(’input’) can be used to read, and convert csv data into a pandas DataFrame. The input
can be a local file, or a link to raw csv data on the web.

[21]: iris_df=pd.read_csv('[Link]
↪→6f9306ad21398ea43cba4f7d537619d0e07d5ae3/[Link]')

[22]: iris_df

[22]: [Link] [Link] [Link] [Link] variety


0 5.1 3.5 1.4 0.2 Setosa
1 4.9 3.0 1.4 0.2 Setosa
2 4.7 3.2 1.3 0.2 Setosa
3 4.6 3.1 1.5 0.2 Setosa
4 5.0 3.6 1.4 0.2 Setosa

145 6.7 3.0 5.2 2.3 Virginica


146 6.3 2.5 5.0 1.9 Virginica
147 6.5 3.0 5.2 2.0 Virginica
2.6. SCIENTIFIC PACKAGES 27

148 6.2 3.4 5.4 2.3 Virginica


149 5.9 3.0 5.1 1.8 Virginica

[150 rows x 5 columns]

2.6.4 seaborn

Seaborn uses both Pandas and MatPlotLib in order to visualize statistical metrics of data. It
focuses on providing nice looking, and highly informative statistical visualization. Seaborn
provides error plots, categorical plots, and many other data set visualization methods. The
plotting functions can be called on data frames, with high customisability. The library is capable
of automatically inferring statistical information onto plots in order to produce informative
visualization.

Example 2.6.4. In this example, we’ll demonstrate a simple use-case for the seaborn package. As
mentioned above, the package contains functions specifically developed for statistical visualiza-
tion. In this example, we’ll use a pair plot to demonstrate the different varieties of parameters for
each class.

[1]: import seaborn as sns


import pandas as pd
import numpy as np

[2]: [Link](style="ticks", color_codes=True)

[3]: iris_df=pd.read_csv('[Link]
↪→6f9306ad21398ea43cba4f7d537619d0e07d5ae3/[Link]')

[4]: g = [Link](iris_df, hue="variety")


28 CHAPTER 2. DATA PROCESSING TOOLS

2.6.5 scikit-learn
Scikit-Learn [16] provides a wide range of tools for predictive data analysis and machine learning
tasks. Scikit was built on the foundations of Numpy, MatPlotLib, and Pandas.
Scikit-Learn is a collection of algorithms that are essential in data science. These algorithms can be
categorized into 6 groups:
1. Pre-processing
2. Dimensionality Reduction
3. Model Selection
4. Classification
5. Regression
6. Clustering
Each category contains numerous implementations to their respective problems.
Classification, and Regression algorithms contain numerous solutions to supervised learning
tasks. These include an easy to use and understand implementation of an Artificial Neural
2.6. SCIENTIFIC PACKAGES 29

Network. While this model can be parametrized with numerous different options, there is
currently no option to construct deep neural networks.
Clustering algorithms can be used to determine, and produce different classes for non-categorized
datasets. This is an unsupervised learning task, and therefore there is no explicit target output.
Model selection can be used to determine which model and parameters to use for a given task.
Model selection algorithms include heuristic search algorithms which aim to find a well
performing models in a hyper parameter space.
Pre-processing, and dimensionality Reduction algorithms might be the most widely used
categories of algorithms in the library. They provide easy-to-use functions, which can transform
data in order to be more easily processed by machine learning models. These algorithms include
various scaling methods among others, which can be essential for machine learning algorithms.
Although deep learning algorithms requires these functionalities, they mostly do not provide them
in their respective libraries. Therefore, a lot of different deep learning frameworks use scikit-learn
for its pre-processing, and dimensionality reduction tools.
Example 2.6.5. In the examples below, the previously read IRIS data set csv file will be processed.
Multiple different data transformation and machine learning techniques will be used in this small
example, which shows the compactness and ease of use of the library.

[1]: import pandas as pd


import numpy as np
import matplotlib
import [Link] as plt
from [Link] import StandardScaler
from [Link] import PCA
from sklearn import preprocessing
from sklearn import tree
import seaborn as sns

[2]: iris_df=pd.read_csv('[Link]
↪→6f9306ad21398ea43cba4f7d537619d0e07d5ae3/[Link]')

[3]: iris_df.head()

[3]: [Link] [Link] [Link] [Link] variety


0 5.1 3.5 1.4 0.2 Setosa
1 4.9 3.0 1.4 0.2 Setosa
2 4.7 3.2 1.3 0.2 Setosa
3 4.6 3.1 1.5 0.2 Setosa
4 5.0 3.6 1.4 0.2 Setosa

The following blocks separate the training, and the target data, and displays their head, as
described in the previous panda’s examples.

[4]: x, y=iris_df.iloc[:,:-1], iris_df.iloc[:,-1:]

[5]: [Link]()
30 CHAPTER 2. DATA PROCESSING TOOLS

[5]: [Link] [Link] [Link] [Link]


0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2

[6]: [Link]()

[6]: variety
0 Setosa
1 Setosa
2 Setosa
3 Setosa
4 Setosa

The following code block creates a correlation matrix of the training data. The pandas [Link]()
function can be used to calculate the correlation between elements. Seaborn is used to create the
representative figure of the correlation matrix.

[7]: [Link](figsize=(12,10))
cor = [Link]()
[Link](cor, annot=True, cmap=[Link])
[Link]()
2.6. SCIENTIFIC PACKAGES 31

The following block uses Scikit’s StandardScaler() class to scale the data by removing the mean, and
scaling to the variance. This is used in order to create a unified scale in order to better represent
individual values.

[8]: x_scaled=StandardScaler().fit_transform(x)

Principal Component Analysis is used to project the data onto a two dimensional plane. pc1
and pc2 represents the two principal components of the original data set.

[9]: pca = PCA(n_components=2)


principalDf =[Link](data = pca.fit_transform(x_scaled), columns =␣
↪→['pc1', 'pc2'])

[10]: [Link]()

[10]: pc1 pc2


0 -2.264703 0.480027
1 -2.080961 -0.674134
32 CHAPTER 2. DATA PROCESSING TOOLS

2 -2.364229 -0.341908
3 -2.299384 -0.597395
4 -2.389842 0.646835

A LabelEncoder is used in order to encode the original labels of the dataset. The LabelEncoder
assigns a unique integer to every class of the original data. The y_num will contain integer
values of the classes in a list.

[11]: le = [Link]()
[Link](y)
le.classes_
y_num=[Link](y)

The commands in the following block are used to draw the results of the PCA values on a two
dimensional plane. The points are colored by their respective classes.

[12]: fig = [Link](figsize = (8,8))


[Link]([Link][:, 0], [Link][:, 1], c=y_num)

[12]: <[Link] at 0x2cf838efcc8>


2.6. SCIENTIFIC PACKAGES 33

A decision tree is constructed using Scikit’s DecisionTreeClassifier() class. The scaled values, and the
numeric labels are trained in order to create the structure within the decision tree.

[13]: clf = [Link]()


clf = [Link](x_scaled, y_num)

the plot tree() function is used to automatically create a visual representation of the trained
decision tree.

[14]: fig = [Link](dpi=150)


tree.plot_tree(clf)
34 CHAPTER 2. DATA PROCESSING TOOLS

2.6.6 tensorflow
Tensorflow [1] is a state-of-the-art framework used in modern machine learning, and deep
learning applications. Tensorflow contains a low-level implementation of the tensor data type,
which is a high dimensional algebraic data type. A tensor denotes algebraic relationships be-
tween objects in a vector space. Tensors contain the general descriptions of every day datatypes
used by the computer.
1. A 0 dimensional tensor is a scalar, holding only a single value.
2. A 1 dimensional tensor is called a vector, which holds an array of values on a 1 dimen
sional axis.
3. A 2 dimensional tensor is called a matrix, which holds values on a 2 dimensional axes.
Any tensor dimension beyond the second dimension is usually called an "N-th dimensional
tensor", where N is the number of dimensions. Tensors are used heavily by graphics computing, and
data sciences. Numpy is an apt framework for linear computation, but lacks the nuances of higher
dimensional structure computing. For this reason, tensorflow was designed to be used in places
where algebraic computing of high dimensional structures is required.
Specifically, tensorflow was created in order to progress the state of neural network computing on
python, and other languages.
Example 2.6.6. In the following blocks, examples of tensors with different dimensions, and
different operations will be shown.

[1]: import tensorflow as tf


import numpy as np

In order to create constants of various types, tensorflow requires the [Link]() constructor to
be used. Unlike previous frameworks, it is essential to explicitly state what the data type is.
2.6. SCIENTIFIC PACKAGES 35

Variables can be also created using [Link](). The difference between variables and constants is
that constants cannot be assigned a new value after instantiation, while variables can be
assigned a new value using [Link]().
In the following blocks, tensors of different shapes can be examined.

[2]: dim_0 = [Link](5)


print(dim_0)

[Link](5, shape=(), dtype=int32)

[3]: dim_1 = [Link]([1, 7, 4, 3, 2])


print(dim_1)

[Link]([1 7 4 3 2], shape=(5,), dtype=int32)

[4]: dim_2 = [Link]([[1,2,3],


[5,4,6],
[9,8,7]],
dtype=tf.float16)
print(dim_2)

[Link](
[[1. 2. 3.]
[5. 4. 6.]
[9. 8. 7.]], shape=(3, 3), dtype=float16)

[5]: dim_3 = [Link]([


[[0, 1, 2, 3, 4],
[5, 6, 7, 8, 9]],
[[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19]],
[[20, 21, 22, 23, 24],
[25, 26, 27, 28, 29]],])

print(dim_3)

[Link](
[[[ 0 1 2 3 4]
[ 5 6 7 8 9]]

[[10 11 12 13 14]
[15 16 17 18 19]]

[[20 21 22 23 24]
[25 26 27 28 29]]], shape=(3, 2, 5), dtype=int32)
To cast a tensor into a numpy array, you can use the .numpy() function of the tensor.

[6]: dim_2.numpy()
36 CHAPTER 2. DATA PROCESSING TOOLS

[6]: array([[1., 2., 3.],


[5., 4., 6.],
[9., 8., 7.]], dtype=float16)

Tensors can be added easily with the built in operators of tensorflow. These implementations
provide a highly optimized parallel implementation of common mathematical operations and
algorithms.
Mathematical operations can also be accessed with common mathematical notation.

[7]: a = [Link]([[1,2,3],
[5,4,6],
[9,8,7]],
dtype=tf.float16)
b = [Link]([3,3], dtype=tf.float16 )

print([Link](a, b), "\n")


print([Link](a, b), "\n")
print([Link](a, b), "\n")

[Link](
[[ 2. 3. 4.]
[ 6. 5. 7.]
[10. 9. 8.]], shape=(3, 3), dtype=float16)

[Link](
[[1. 2. 3.]
[5. 4. 6.]
[9. 8. 7.]], shape=(3, 3), dtype=float16)

[Link](
[[ 6. 6. 6.]
[15. 15. 15.]
[24. 24. 24.]], shape=(3, 3), dtype=float16)

[8]: print(a + b, "\n")


print(a * b, "\n")
print(a @ b, "\n")

[Link](
[[ 2. 3. 4.]
[ 6. 5. 7.]
[10. 9. 8.]], shape=(3, 3), dtype=float16)

[Link](
[[1. 2. 3.]
[5. 4. 6.]
[9. 8. 7.]], shape=(3, 3), dtype=float16)
2.6. SCIENTIFIC PACKAGES 37

[Link](
[[ 6. 6. 6.]
[15. 15. 15.]
[24. 24. 24.]], shape=(3, 3), dtype=float16)

The following block contains various operations executed on a matrix. The reduce functions can
be used to calculate their respective mathematical values, which is the maximum of the matr ix
in this example. reduce_sum() can be used to add elements of the matrix from different axes.
The [Link]() function returns the index of the largest element in each column. Different arg
functions can be used to search for elements and view the data from different axes.
Tensorflow’s nn part of the framework contains functions and algorithms specifically used for neural
networks. The softmax() calculates the softmax function on the input data. This function is used
regularly in deep learning.

[9]: print(tf.reduce_max(a))
print([Link](a))
print([Link](a))

[Link](9.0, shape=(), dtype=float16)


[Link]([2 2 2], shape=(3,), dtype=int64)
[Link](
[[0.09 0.2446 0.665 ]
[0.2446 0.09 0.665 ]
[0.665 0.2446 0.09 ]], shape=(3, 3), dtype=float16)

2.6.7 Keras
Keras [5] contains tools for implementing state-of-the-art deep learning systems. Keras is a general
implementation of modern deep learning tools, which can use multiple back-end frameworks.
Keras can use the following backends:
1. Tensorflow,
2. Theano, a symbolic tensor framework,
3. CNTK, an open-source deep learning toolkit developed by Microsoft.
Keras remains consistent after changing the backend of the framework.
Keras provides an easy-to-use, but highly flexible implementation of modern deep learning
algorithms. However, it does not contain general machine learning or artificial intelligence
algorithms that might be required during the building of a deep learning model. Therefore, it is
advised to use the previously mentioned frameworks in conjunction in order to create a robust,
and flexible system.
Keras contains different implementations, and adheres to different coding styles for different levels
of complexity.
In this lecture note, we will review the Keras Sequential, and Layer API.
Example 2.6.7. The following example will run through a basic example of a simple neural
network learning the MNIST handwritten digit data.
38 CHAPTER 2. DATA PROCESSING TOOLS

Tensorflow contains toy datasets to test and measure the performance of machine learning models.
The MNIST dataset can be loaded with [Link].load_data()

The data is split into training, and testing data with their respective target values (digits). Ele
ments are divided by 255 because the data consists of monochrome pixel data originally. This
data is transformed in order to get them into a range which is easier to learn by neural networks.

[1]: import tensorflow as tf

[2]: mnist = [Link]


(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

One of the easiest ways to construct neural network models is to use the Sequential API. Different
layers can be concatenated by passing a list of layers into the sequential model.

1. The Flatten layer creates a 1 dimensional vector from an input vector of any higher di
mension by appending rows into a 1D tensor.

2. The Dense layer is the most basic representation of a fully connected feed forward neural
network layer. The first integer parameter denotes the amount of neurons that particular
layer consists of.

[3]: model = [Link]([


[Link](input_shape=(28, 28)),
[Link](128, activation='sigmoid'),
[Link](64, activation='relu'),
[Link](10, activation='softmax')
])

The following code block calculates the probabilities of what the neural network thinks the first
number is in the data set. Higher number represents higher probability. As it can be clearly
seen, the neural network cannot make informed decisions before training.

[4]: predictions = model(x_train[:1]).numpy()


predictions

[4]: array([[0.09198333, 0.18876909, 0.0581247 , 0.0756225 , 0.11510804,


0.05142717, 0.06212557, 0.16737589, 0.11147986, 0.07798386]],
dtype=float32)

After compiling the model, using a neural network optimizer and a loss function, the neural
network can be trained using the fit() function. The number of epochs can be set in order to
determine how many iterations the model trains for. After training, history data will be returned,
which can be used in order to visualize the training results.

[5]: [Link](optimizer='sgd',
loss=[Link](),
metrics=['accuracy'])
2.6. SCIENTIFIC PACKAGES 39

[6]: history=[Link](x_train, y_train, epochs=5)

Epoch 1/5
1875/1875 [==============================] - 1s 771us/step - loss: 1.3349 -
accuracy: 0.6718
Epoch 2/5
1875/1875 [==============================] - 2s 821us/step - loss: 0.5247 -
accuracy: 0.8602
Epoch 3/5
1875/1875 [==============================] - 1s 769us/step - loss: 0.3999 -
accuracy: 0.8860
Epoch 4/5
1875/1875 [==============================] - 1s 790us/step - loss: 0.3561 -
accuracy: 0.8976
Epoch 5/5
1875/1875 [==============================] - 1s 766us/step - loss: 0.3318 -
accuracy: 0.9040

The evaluate function can be used to calculate the loss, and accuracy on the testing data. Sep
arate sets of data are required in order to test the generalization ability of the neural network.
If the trained model is capable of performing well on unknown data, then it will yield a higher
accuracy.

[7]: [Link](x_test, y_test, verbose=2)

313/313 - 0s - loss: 0.3135 - accuracy: 0.9084

[7]: [0.31353357434272766, 0.9083999991416931]

The following code blocks will visualize the achieved accuracy and loss through the training after
every epoch.

[8]: from matplotlib import pyplot as plt

[9]: [Link]([Link]['accuracy'])
[Link]('model accuracy')
[Link]('accuracy')
[Link]('epoch')
[Link](['train'], loc='upper left')
[Link]()
40 CHAPTER 2. DATA PROCESSING TOOLS

[10]: [Link]([Link]['loss'])
[Link]('model loss')
[Link]('loss')
[Link]('epoch')

[Link](['train', 'val'], loc='upper left')


[Link]()
2.6. SCIENTIFIC PACKAGES 41

The methods, and algorithms in the examples will be partly discussed in the following chapters.

Question 12.
Why is the choice of tools important in the data science process?

Question 13.
What are the strengths of Microsoft Excel, and what kind of problems would you solve with it?

Question 14.
What are the weaknesses of Microsoft Excel?

Question 15.
Which programming language can be used to create complex data processing routines for Ex
cel?

Question 16.
Which field of science was Matlab and R designed for?

Question 17.
What are the main benefits of using Matlab instead of Excel?

Question 18.
Which software is more fitting for complex data analysis tasks?
42 CHAPTER 2. DATA PROCESSING TOOLS

Question 19.
What does Matlab provide in order to make it easier to make complex data analysis tasks easier?

Question 20.
What are the main differences between Matlab and R?

Question 21.
What is the main focus of the R programming language?

Question 22.
What are the benefits of R in terms of accessibility?

Question 23.
What was one of the focus of R that can provide a benefit for scientific developers?

Question 24.
Is the python programming language compiler, or interpreted language?

Question 25.
What kind of programming paradigms are supported by python?

Question 26.
What is the name of the environment that runs python programs?

Question 27.
What kind of typing does Python use?

Question 28.
How are the python package managers called?

Question 29.
What are the main differences between the pip and the anaconda environments?

Question 30.
In what task does the PyCharm development platform perform better than Jupyter Notebook?

Question 31.
What is the main benefit of writing scripts instead of notebooks?

Question 32.
What development style does the Jupyter Notebook implement?

Question 33.
How does Jupyter support readability in its notebooks?

Question 34.
What are the benefits of hosting jupyter notebook servers on computing clusters?
2.6. SCIENTIFIC PACKAGES 43

Question 35.
Why do you think most data science developers choose python instead of the competition?

Question 36.
Why are packages important for the python data science workflow?

Question 37.
What are the two packages that can be used to replace Matlab and R languages?

Question 38.
Which one of the python package provides the mathematical basis for other packages and scientific
computing in general?

Question 39.
Which package can be used for professional, scientific figure generation in python?

Question 40.
Which package serves as a general machine learning framework that can be used by other packages
as well.

Question 41.
What is the difference between Scikit-Learn and the Keras package? Which one would be better for
a deep learning task.
44 CHAPTER 2. DATA PROCESSING TOOLS
Chapter 3

Descriptive Statistics

Mathematical Statistics is one of the most prominent, and heavily used applied fields of math
ematics. Statistics applies probability theories in order to process, filter data and predict out
comes.

3.1 Randomness
Descriptive Statistics is one of the sub-fields of statistics. This sub-field is dedicated to de-
scribing and summarizing the properties and attributes of data sets. Descriptive statistics was
one of the fields of science that data science heavily incorporates. Most methods of descrip
tive statistics are implemented as parts of computer software, or framework methods. In this
chapter, various python scientific frameworks will be used to implement and showcase these
algorithms.

Statistical Models

A statistical model can be described as a random experiment, that is contained within the sample space
Ω. The sample space Ω describes possible occurrences within the sample space.
A random variable can be explained by a mapping, or a function which do not have an exact
outcome. Instead, random variables map the input onto a set of possible occurrences that might
happen with that input. When used in conjunction with probability P(), they describe what the
actual probability of that outcome is. In short, the actual outcome of the random variable depends
on randomness.
The random variable X can also be described with X = (X1, X2...XN) random vector of occurrences.
The random variable X maps ω outcome of the sample space Ω onto a numeric space. The
randomness of the variable is actually the random nature of the experiment. At the moment the
experiment is performed, the variable’s value is decided. The real value of this random
experiment’s outcome is the mapping itself.
The following formula describes the nature of random variables:

X:Ω→ E,WhereΩisasetofpossibleoutcomes,andEisameasurablespace

45
46 CHAPTER 3. DESCRIPTIVE STATISTICS

The formula describes that the random variable takes all possible outcomes in the set Ω, and maps
it onto a measurable scale.
The probability of X taking a value in the measurable set S is described in the following formula:

P(X ∈ S) = P({w ∈ Ω|X(ω) ∈ S})


Where S is the measurable set and ω is a possible outcome.
There are two categories of random variables, which are discrete random variables, and continuous
random variables. Each of the different categories corresponds to a probability distribution. The
discrete random variable means that the output of the mapping of the function X : Ω → E is
discrete. This represents that for each input, the mapping returns a natural number assigned
to that input. Contrary to this, continuous random variables the output of this mapping is un-
countably infinite, meaning that for every possible input, a different continuous value can be
assigned.
Random variables can be described easily with cumulative distribution functions. The following
formula describes the formal description of the cumulative distribution function of random variable
X, with the input value of x.
(
FX x) = P(X ≤ x)

The right side of this equation represents the probability that the random variable takes on a
value that is less or equal to x The probability is in a semi-closed interval of [i, j) where i < j.

P(i < X ≤ j) = FX(i) − FX(j)

Example 3.1.1. For example, a given a X random variable that is takes on discrete values between 0
and 5 with different probabilities.
The cumulative distribution function of X could be written as the following:

0 if x ≤ 0
1/5 if 0 < x ≤ 1
2/5 if 1 < x ≤ 2
(
FX x) = 4/5 if 2 < x ≤ 3
9/10 if 3 < x ≤ 4
10/10 if 4 < x ≤ 5

This cumulative distribution describes how the input might appear in certain parts of our random
variable distributing. Do not forget that the probability associated to the input repre sents the
probability that our random variable will be less or equal than the input.

3.2 Statistical Data Types


Different data types need to be clearly distinguished in order to be used effectively. There are a
number of different mathematical definitions for data types. As the field of data science began as a
sub-field of statistics, statistical definitions are used most widely in data science.
3.2. STATISTICAL DATA TYPES 47

Statistical data types are constructed with the intent to create statistical measurements of certain
attributes of the data. For this reason, not every data type fits the prerequisites constructed by
this definition. Each data type is constructed with measurements, and specialised attributes in
mind. Therefore, it is essential that the reader understands the meaning and background of
each thoroughly.

Four different levels of measurements will be defined. Levels of measurements are incremental, which
means that a particular level contains all the attributes of the previous groups.

3.2.1 Categorical
The first two levels of measurement of statistical variables is a group named categorical variables.
Categorical variables contain data values which do not have any measurable difference between
values. Each value contains a distinct piece of information, without any meaningful way of
comparing them.

Nominal

Nominal is the first level of measurement, which describes named variables.

The variables contain nothing but label, which describes a class of values. Although nominal
variables can be used for classification tasks, they hold no numerical value.

Numerical calculations cannot be done on nominal variables, and therefore using them for
classification is their only significance.

Example 3.2.1. A question in a questionnaire asks the following:

What is your favourite colour?

◦ White

◦ Black

◦ Blue

◦ Green

◦ Something else

Whatever the answer may be, it holds only relative and subjective information. The only possible
calculation one might do with these answers is to count them up in order to count the number of
answers to a specific option. This can be used to calculate the popularity of a certain colors, and can
be used for visualization in bar, or pie charts.
48 CHAPTER 3. DESCRIPTIVE STATISTICS

Ordinal

As stated before, the next level of measurement called Ordinal, holds all the qualities of previous
levels. Ordinal variables are Nominal, meaning they hold named values, where each value can be
clearly ordered. The ordering does not hold the difference between each value, only the order
in which they are represented.

The true zero of a set of ordinal elements is unknown, and therefore scale is non-existent. As the
difference between elements is unknown, most of the numerical calculations are void as wel l.

Example 3.2.2. The following example represents a question with answers that belong in the
ordinal level of measurement:

How satisfied are you with this book?

1. Very Satisfied

2. Satisfied
3. Neutral

4. Unsatisfied

5. Very Satisfied

Ordinal variables, as stated before contain a clear ordering in their set of values. The order in
this case can be interpreted as the level of satisfaction the reader feels after reading a book, ranging
from Very Unsatisfied to Very Satisfied. The answers are incremental, but the difference between each
level of satisfaction is unknown.
3.2. STATISTICAL DATA TYPES 49

3.2.2 Numeric
The two remaining levels of measurements are categorised as numeric. They contain all the
attributes of previous levels, and they hold quantitative information, which can be used during
calculations.
The most common mathematical data types are the ones describing a number. A number is, by
definition a mathematical object, which can be used to count, measure or label. Any written
symbol, that describes a number is called a numeral. Over the history of mathematics, different
classifications of numbers have been created. Integers, and real numbers are two of the most
important number classes for data science.

Discrete (Integer)
An integer is described as a combined set of the natural numbers, and their negative counterparts.
0 is the only natural number which doesn’t have a negative counterpart, therefore it can be viewed
as the middle element of the integer set. The resulting class of numbers is called integer, and is
denoted by Z
In data science, integers are used to describe data with distinct values. This consequently
means, that the values of a discrete data column can only contain easily distinguishable val
ues. Discrete numbers are often used for classification, as the values create a distinct set of
possible values.

Continuous (Real)
Continuous variables are described as containing real numbers, and often represents measure
ments. Real number is a set of numbers which are made up by the set of integers, and rational
50 CHAPTER 3. DESCRIPTIVE STATISTICS

numbers. A real number represents a value, or point on a number line, and are mostly irrational. Real
numbers are denoted by R.
Real numbers often represent real-life measurements which does not discrete values. Such values are
for example data of stock prices or the data of drug effectiveness.
Regression problems consist of discovering correlation and connection between attributes in
order to predict outcomes with unknown attribute values.

Interval
Interval data represents values which have an equal difference between each ordered element.
The difference between two proceeding or following element is equal and standard. The problem
with interval data is that the zero element of an interval dataset is not trivial.
For this reason, only 2 of the 4 basic mathematical operations are allowed, namely addition, and
subtraction. Multiplication and division can’t be used in the case of interval dat a, as shown in the
example above Interval data can be measured on a scale, which can describe the quantitative difference
of two elements in the interval value set.
Example 3.2.3. The classic example of interval data is temperatures.
What is the current temperature?
1. -20
2. -10
3. 0
4. 10
5. 20
6. 40
In this case, the question itself is wrong. The unit of measurement is not clear, which creates an
interesting problem. Measurements could be either in Fahrenheit, or Celsius.
Using different unit of measurements nets drastically different temperatures. For example, 0 *C
equals to 32 *F. It is clear that 2 × 32 is 64, the problem occurs, when we try to multiply the same
temperature in Celsius. This clearly shows the nature of Interval variables, and the non-existence
of absolute zero value.

Ratio
The difference between ratio data and interval data is that unlike interval, ratio always has a zero
value. Ratio values have the properties of interval value which were mentioned above Distance
between values are also equal in this case. All four basic operations can be used without problems
during statistical analysis or transformation.
In ratio data, zero is treated as an absolute zero value, therefore it can’t contain negative values.
Example 3.2.4. Following the last example, one unit of measurement that is considered a ratio is the
Kelvin temperature scale.
What is the current temperature?
3.3. STATISTICAL MEASUREMENTS 51

1. 253K
2. 263K
3. 273K
4. 283K
5. 293K
6. 313K
In this example 0K represents the absolute zero value of temperature. On 0K, molecules stop
moving, and reach a state of minimal entropy. 2 × 253K nets the temperature that is twice as hot,
which is attribute of ratio variables. To calculate the multiplication of a certain temperature in
Fahrenheit, or Celsius, it has to be converted to kelvin, multiplied, and converted back to their
respective units.
Example 3.2.5. Another example of ratio measurements are height, or weight measurements.
What is your height?
1. 150 cm
2. 160 cm
3. 170 cm
4. 180 cm
5. 190 cm
6. 200 cm
In this example, 0cm represents a size of absolute zero. Height can be represented by other
measurement units, but the attributes of the ratio variables are not violated. Height can be
multiplied or divided in order to get the multiplies of some height. 150cm × 3 = 450 cm

3.3 Statistical Measurements


In statistics, three kinds of central tendencies are usually discussed, which are the mean, the median,
and mode. We will discuss these values briefly in order to better understand the nature of statistical
distributions.
Mean is considered to be one of the most important measurements of central tendency. The mean
is calculated by the sum of all values, divided by the number of values.
Example 3.3.1. Let A be an n long array of numerical data. The mean of A can be calculated using
the following formula:

mean(A) = i=1 Ai
n
Mode of the data set is the element which occurs the most often within the data set. If all
elements are unique in the data set, than there is no mode. Modality describes the number of
peaks a distribution has. For example, the gauss distribution is unimodal, as it only has one
52 CHAPTER 3. DESCRIPTIVE STATISTICS

peak in its probabilities. Bimodal, and multimodal distributions exist which have 2, or more
maximum probabilities at different parameter values. Mode is unique to statistical variables,
as it can be used even if the values are not numerical. The mode of categorical values can be a
valuable statistic.
The Median of the dataset describes the element, which is in the value-wise middle of the
dataset. If the contained value of the data set is non-ordered, it has to be ordered before cal
culating the median. The median is a very useful statistical value for representing centrality, as
the median represents the value at the middle of the dataset after ordering. Median remains
largely unaffected by outliers, and therefore can be a good indicator when searching for such
values.

3.3.1 Quantile
The median can be described as the 50% quartile of the data set. Quartiles describe data points
at a given position after ordering. The minimum is the 0% quartile, as it shows the data point
right at the start of the data set after ordering, which is always contains the smallest value. The
maximum can be also called the 100% quartile, for the same reasons. The 25% quartile, and 75%
quartile are usually calculated along with the other quartiles during data analysis. Quartiles are
a good indicator of how values span across the data set. If data is largely consistent, while one
stands out from the rest, it could be a good representation of outlier, or faulty data.
The range between the 25% and 75% quartiles are often called the interquartile range. This range
measures where the majority of the values are in the data set.

3.3.2 Variance
Variance is a statistical measure which can be used to tell how spread out the values are in the data
set. Variance uses the mean and the data values during calculation. The following formula contains
the calculation for the variance:

∑ i =1 x − )
σ2 = ( i x 2
n
Where xi is the value of the i. data, x is the mean of the data points, and n is the number of data points.
Variance measures the variability of the data set. Variability describes how far the data points spread
from the mean. The variance is squared, as the difference between data points and the mean is
squared. This results in a different scale than the original data set.

3.3.3 Standard Deviation


Standard deviation is the square root of deviation, denoted by σ. This value can be used to better
measure the actual distance between elements, and the mean of the data set. Standard deviation
is numbered, and often attributed a score. Mean+1σ encompasses data that is one unit of standard
deviation away from the mean. The units of standard deviation usually go up to 3. 3σ usually covers
the whole distribution.

3.3.4 Covariance
Previous statistical measurements have only covered univariate measurements. The number of
parameters describe the type of statistical measurement or distribution. Univariate only takes
3.3. STATISTICAL MEASUREMENTS 53

one variable, and outputs some statistical information while Bivariate functions take multiple
variables. One of the most important bivariate measurements are correlation and covariance.
In order to understand what correlation is, it’s easier to start with covariance, as the first de-
pends on the later. Covariance can be used to describe what relationship between two variables
is. It’s either negative, or positive, signifying a positive or negative dependency between vari
ables.
The covariance of two variables can be calculated with the following formula:

∑i=1 X −) ∗ (Y −Y )
COV(X,Y) = ( i i ˆ
n−1
Y
Where X, and Y are the random variables, and X and ˆ are the means of the variables, while n is
the number of elements in the data set. The naming similarity between variance and covariance
is not a coincidence. The formula for the two values are practically identical with two variables
as opposed to one. Variance calculates the variability of the data points, while covariance cal
culates the variability between two random variables. Do not however, the value of covariance
might not be a good indicator of how strong the connection is, only whether it is positive or
negative.

3.3.5 Correlation
Correlation is supposed to describe the relationship, or how alike two random variable’s distribution
is. It can be used to describe association between elements, and can be used to express a variable’s
behaviour in contrast with another.
Understanding correlation between variables can lead to a better understanding in data. If two
variables have a big correlation, it means that they depend on each other while small correlation can
mean that the two variables are independent. Correlation can be used to uncover unforeseen, or
seemingly unseen relationships between different variables.
Correlation takes the formula of covariance, and divides by the two variable’s standard devia
tion.

COV X,Y
CORR(X,Y) = ( )
σxσy
Where σx, y are the standard deviations of the random variables X, and Y.
The value of correlation runs in the range of [−1,1]. a correlation value of means that there
is no correlation between the variables, the two variables are completely independent from one
another. The value of 1 describes two random variables that are completely dependent on one
another, while −1 describes a relationship that is inversely dependent. An inverse relationship
means that the two variables are correlated and dependent on each other, but as the value of
one variable increases, the other will decrease. Therefore, the absolute value of the correlation
can be thought of as the magnitude of dependency, while the sign is the direction.

3.3.6 Skewness
Skewness is an indicator of how much the given distribution is different from the given stan
dard distribution. The skewness of distributions can be measured in order to determine how
54 CHAPTER 3. DESCRIPTIVE STATISTICS

off the distribution is from being symmetric. This measurement can range from −∞ to +∞. A
negative measurement means that the distribution is placed mostly off center towards negative
values on the left. A positive measurement means that the values are mostly piled on the right side
of the mean, towards the positive side.

3.3.7 Kurtosis
Kurtosis is another important measurement of distributions. The kurtosis value of the data set
describes how centered the values are to the mean. High kurtosis means that the tails of the
distribution contain more data compared to a standard distribution. High kurtosis can be an
indicator of outlier data.

3.4 Distributions
A probability distribution can be described as a function which assigns a probability to each
possible subset of outcomes of a random experiment. The inputs do not need to be numerical, as
random variables can be used which map the input onto numeric space, as described before. In this
lecture, we’ll only talk about discrete probability distributions.
Distributions are an essential part of everyday work of a data scientist. Data analysis consists mostly
of applying statistical knowledge on the data in order to closely inspects attributes. Exploratory
Data Analysis can uncover hidden characteristics of the data set that be of great help when creating
machine learning models.
Data have a wide range of statistical attributes that can be wildly different from one another.
However, the set of data we are working with, or plan to work with is always a sample of real
world. The distribution in the data represent certain attributes in the features of the data. By
learning about the statistical distributions, we are enabling the better understanding of data and the
ability to predict it well.
Data types are usually composed of numerical and categorical data columns. The numerical
columns can be either continuous or discrete. In this lecture note we’ll discuss discrete
distributions Discrete random variables can be used to calculate the probability mass functions,
while continuous random variables are used to calculate the probability density function. In the
following sections, several frequently used distributions will be explored.

3.4.1 Bernoulli Distribution


The bernoulli distribution serves as a starting point for more complex distributions. It provides
an easy to understand explanation which can still be used in everyday life, and during data
analysis. The bernoulli distribution is a discrete probability distribution can take only one in -
put, and the output can only be binary. This means, that the bernoulli distribution describes a
probability distribution of a random variable X over two possibilities with one trial. This distri
bution can be used to model a simple true/false question. The probability mass function for the
bernoulli distribution is can be described formally by the following formula:

p if x = 0
( 1-p if x = 1
PX x) =
0 if x ∈ Ω
3.4. DISTRIBUTIONS 55

The corresponding distribution function can be formulated in the following way:

{
p if x = 0
DX(x) =
1-p if x = 1

Example 3.4.1. The easiest example for the bernoulli distribution is a simple coin toss. In case the coin
is fair the probability distribution of the coin toss is the following:

{
( 1/2 if x = 0
PX x) =
1/2 if x = 1

Where 0 is the numeric representation of heads and 1 is tails. Any other probability value in this
distribution would mean that the coin is not fair.

In the following example, a basic bernoulli distribution plotting method is created in order to
visualize the distribution of different probability pairs.

[ ]: import pandas as pd
import numpy as np
import [Link] as plt

[ ]: def plot_bernoulli(x, y):


[Link](y, x)
[Link]('Bernoulli Distribution', fontsize=12)
[Link]('Probability', fontsize=12)
[Link]('Outcome', fontsize=12)
axes = [Link]()
axes.set_ylim([0,1])

[3]: x = [Link]([0.5, 0.5])


coin = [0, 1]
plot_bernoulli(x, coin)
56 CHAPTER 3. DESCRIPTIVE STATISTICS

The following example shows a loaded coin, which has different probabilities for each of it’s
faces.

[4]: x = [Link]([0.7, 0.3])


coin = [0, 1]
plot_bernoulli(x, coin)
3.4. DISTRIBUTIONS 57

3.4.2 Uniform Distribution


The uniform distribution is another easy to understand, and implement distribution. Think of
an evenly distributed bernoulli distribution, but the sample space Ω can contain any number
of outcomes. Therefore, besides 0 and 1, there could be any number of other outcomes mapped
by the random variable mapped onto numeric space. Uniform distribution describes such a
set of data, where all outcomes can happen with the same probability. There are finite number
of outcomes that can all happen with equal probability. Therefore, whatever the number of
possibilities, each one of them has the probability n , where n is the number of possibilities.
Example 3.4.2. The easiest example for the uniform distribution is a dice throw. The following
distribution contains the probabilities of a fair dice landing on a given face.:

if x = 1
1/6 if x = 2
1/6 if x = 3
(
PX x) = 1/6 if x = 4
1/6 if x = 5
1/6 if x = 6

[1]: import pandas as pd


import numpy as np
import [Link] as plt
58 CHAPTER 3. DESCRIPTIVE STATISTICS

[2]: def plot_uniform(x,y):


[Link](y, x)
[Link]('Probability', fontsize=12)
[Link]('Outcome', fontsize=12)

[Link]('Uniform Distribution', fontsize=12)


axes = [Link]()
axes.set_ylim([0,1])

[3]: x = [Link]((6), 1/6)


y = [1,2,3,4,5,6]

[4]: plot_uniform(x,y)

[5]: x = [Link]([1/6, 2/6, 0, 1/6, 2/6, 0])


y = [1,2,3,4,5,6]
plot_uniform(x,y)
3.4. DISTRIBUTIONS 59

3.4.3 Binomial Distribution


The Binomial distribution is a discrete probability distribution that counts the successes in a
sequence of experiments. The Bernoulli distribution can be thought as a special case of the bi
nomial distribution where the number of tests is one. Just as in the bernou lli distribution the
outcomes will always be boolean, meaning either a true/false value, usually represented
with 0 and 1. The process of experimenting is also called the bernoulli process. An important
property of the binomial distribution is that it models a n number of experiments with replace
ment from a population of N.

The random variable of the binomial distribution describes the number of cases where individual
experiments yield a true value. The distribution has two independent parameters:

1. Probability associated to one of the cases

2. Number of experiments

If the random variable X follows the bernoulli distribution with parameters n, p, we write
X B(n, p). In order to see exactly k successive experiments resulting in a success, out of n
independent bernoulli experiments can be calculated using the probability mass function of the
distribution.

( )
n k
f(k, n, p) = P(k; n, p) = P(X = k) = p (1 − p)n−k
k
60 CHAPTER 3. DESCRIPTIVE STATISTICS

Where
( )
n !
= n
k k!((n − k)!)

Example 3.4.3. As with the bernoulli distribution example, we will use coin tosses to model the
binomial distribution. We’ll look at the distribution of a series of fair coin tosses, and a series of unfair
coin tosses.

If you hold 7 trials, and want to calculate the probability of 3 being true if the probability of it
coming out true is 0.3 is

( )
7
P(3;7,0.3) = 0.33 ∗ (1 − 0.3)7−3 = 35 ∗ 0.0064827 = 0.2268945
3

Therefore, the probability of 3 trues appearing out of 7 experiments is 0.226, or 23%

In the following code blocks, the binomial distribution for certain number of elements and their
probability is plotted. The x axis contains the number of elements k. On the y axis, the associated
probability of k number of elements actually coming out as trues can be read.

[1]: import pandas as pd


import numpy as np
import [Link] as plt
import [Link] as stats
import seaborn as sns

The Scipy package is used to calculate the probabilities for a given number of elements and
probability. Scipy provides numerous algorithms in the field of mathematics.

[2]: def plot_binom(prob, num):


x = [Link](0, num+10)
binom = [Link](x, num, prob)
[Link](x, binom, label="p = {:f}".format(prob))
[Link]('Random Variable', fontsize=12)
[Link]('Probability', fontsize=12)
[Link]("Binomial Distribution")
[Link]()

[3]: plot_binom(.5, 30)


3.4. DISTRIBUTIONS 61

[4]: plot_binom(.7, 30)


62 CHAPTER 3. DESCRIPTIVE STATISTICS

3.4.4 Gaussian Distribution


One of the most important, and widely used distribution is called the Gaussian or Normal distribution.
This distribution is special, as there are numerous examples of seemingly unrelated things follow
this data distribution in everyday life. The number of, and the examples themselves can be shocking
and the correlation between elements following the normal distribution is seemingly unexplainable.
The probability density function of the distribution is:

1
f(x) = √ ∗ e−2 ( χσµ )2
σ 2π
In the formula, the mu is the mean of the distribution, and σ is its standard deviation.
The distribution appears so often in nature that it is often used for continuous random variable whose
distribution is unknown. The importance of this distribution can be accredited to the central limit
theorem. This theorem states that the mean of many samples of a random variable with finite mean
and variance is a random variable whose distribution converges to a normal distribution as the
number of samples increase. In the real world, the sum of independent processes often follow
distributions that are small variations of the normal distribution. Normal distribution is unimodal,
meaning that it only contains a single peak.
If a random variable follows a normal distribution, it is written as X N (µ, σ2) A normal distribution
is also often called a bell curve, as its distribution over a large number of elements resembles a bell-
like shape. Examples of the gaussian distributions include school grades, financial distributions,
average heights among many more.
3.4. DISTRIBUTIONS 63

An interesting property of the gaussian distribution is that the mean, mode and median are all
exactly the same value. The distribution is also symmetric, which can be derived from the
previous property. The area under the curve always adds up to one.

Data sets that are not distributed can be transformed onto a normal distribution using
transformations such as logarithmic function, and square root values.

Example 3.4.4. In the following code blocks, a function to display normal distributions of various
sized is created. The mean µ and the standard deviation σ can passed as parameters in order to
display normal distributions of different parameters.

[1]: import pandas as pd


import numpy as np
import [Link] as plt
import [Link] as stats

[2]: def plot_normal(mu, sigma):


n = [Link](mu-(mu*sigma)/10-sigma*4, mu+(mu*sigma)/10+sigma*4)
normal = [Link](n, mu, sigma)
[Link](n, normal)
[Link]('Distribution', fontsize=12)
[Link]('Probability', fontsize=12)
[Link]("Normal Distribution")

[3]: plot_normal(20, 8)
64 CHAPTER 3. DESCRIPTIVE STATISTICS

3.4.5 Poisson Distribution


The Poisson distribution is a unique discrete distribution that can be used to express the probability of
events over a given number of experiments. The name of the distribution comes from the
mathematician Siméon Denis Poisson. Given a fixed number of experiments, or a frame of time the
distribution expresses the probability of a given event happening for k number of times. If k = 1,
thedistributioncanbeusedtodescribetheprobabilityofanevenhappeningacrossn number of
experiments. If the number of experiments is the minutes, or second of time, then it can be used to
analyze certain event happening within a frame of time.
There are properties, and prerequisites of using the Poisson distribution to analyze such proba
bilites:
1. Two events cannot happen at the same time
2. Any event can happen any number of time in the given number of experiments
3. Every event in the event space are independent to one another. This means that there can
be no connection between events that might increase the probabilities of one happening.
4. The average rate of events is fixed.
The probability mass function of the Poisson distribution can be expressed with the following
formula:

λk e −λ
f(k; λ) = P(X = k) = ∗
k!
X is a random variable following a Poisson distribution, which can be expressed with X Pois(λ) The
λ parameter is the rate of events happening, meaning that at every λ > 0 interval, an event can
happen. λ is a positive real number which is equal to the expected value of the random variable
X λ = E(X)
Example 3.4.5. The probability that a given event happens in n number of experiments can be
calculated using the previously discussed probability mass function. One famous example for
the Poisson distribution is the number of meteors that hit earth through out a given time frame
of years. Meteors can be modelled with this distributions because they are independent, and
the average number of meteors that get close to earth’s atmosphere is relatively constant. One
condition we’ll have to make is that meteor strikes cannot happen simultaneously.
We’ll say that actual meteors hit earth once every one hundred years. What is the probability that
a meteorite hits earth in the next hundred years? Since meteorite hits happen every 100 years, the
rate value of λ is 1.

11 e −1
P(k = 1 in the next 100 years ) = ∗ = 0.36787944117
1!

Therefore, the probability that one meteor will crash into earth is roughly 37%
The following code blocks contain the use of Scipy stat’s Poisson distribution generation function.
The distribution is plotted with matplotlib.

[1]: import numpy as np


import [Link] as plt
import [Link] as stats
3.4. DISTRIBUTIONS 65

[2]: def plot_poisson(n, lambd):


x = [Link](0, n)
poisson = [Link](x, lambd)
[Link](x, poisson, '-o', label="lambda = {:f}".format(lambd))
[Link]('Number of Events', fontsize=12)
[Link]('Probability', fontsize=12)
[Link]("Poisson Distribution varying lambda")
[Link]()

[3]: plot_poisson(20, 10)

3.4.6 Exponential Distribution


The exponential distribution function is going to be the last distribution this lecture note discusses.
This continuous distribution can be used to describe the probability of a given event happening
within a time frame. The probability distribution associated to the time between effects is called
the Poisson point process.
The probability density function of the exponential distribution can be expressed with
{
λ∗e−λχ ifχ≥0
f(χ; λ) = P(χ; λ)
0 if x < 0
Here, the λ is the rate of the given event happening, called the rate parameter. The same param
eter is present in the Poisson distribution. If the distribution of random variable X is exponen
tial, we denote it by X Exp(λ) The difference between Poisson and the exponential distribution
66 CHAPTER 3. DESCRIPTIVE STATISTICS

is that the Poisson distribution is used when dealing with the number of occurrences in a given
time frame, while the exponential distribution deal with the probability of an event occurring
after certain number of experiments.

Exponential distribution is a special case of the Gamma distribution. The gamma distribution is
used to calculate the probability of the kth event happening within a time frame. Exponential
distribution is the gamma distribution with k parameter set to one.

Example 3.4.6. The classic example for exponential distribution is survival analysis. Survival
analysis deals with the probability that a given mechanical part, or device survives a given time
frame, usually given in years. For example, the probability that a car tire has to be changed
within 3 years, according to average between car tire changes. It is called exponential, as the
chance of survival is high at newer states, but it exponentially decreases over time.

[1]: import numpy as np


import [Link] as plt
import [Link] as stats

[2]: def plot_exponential(n, lambd):


x = [Link](0, n, 0.1)
y = lambd*[Link](-lambd*x)
[Link](x,y, label="lambda = {:f}".format(lambd))
[Link]('Random Variable', fontsize=12)
[Link]('Probability', fontsize=12)
[Link]("Exponential Distribution")
[Link]()

[3]: plot_exponential(12, .4)


3.4. DISTRIBUTIONS 67

Question 42.
Why is randomness so hard to discuss both in a mathematical sense, and in computer science?
What are the main obstacles of implementing random algorithms on a computer?

Question 43.
What does Ω represent in probability theory?

Question 44.
What values are represented in a cumulative distribution function?

Question 45.
What are the statistical data types of which no difference can be measured?

Question 46.
Which categorical value can be ordered, but no distance can be measured between elements?

Question 47.
How are non-categorical statistical data categorized?

Question 48.
Which domains of numbers are usually represented as statistical data types.

Question 49.
What are the differences between numeric interval, and numeric ratio data types?
68 CHAPTER 3. DESCRIPTIVE STATISTICS

Question 50.
What are the three measurements which can be classified as "central tendencies"?

Question 51.
Which measurement can be used on categorical data?

Question 52.
What quantile is maximum, minimum, and the median of the data?

Question 53.
What is the interquartile range of the data set?

Question 54.
What does variance measure in a data set?

Question 55.
What is the relationship between the variance and standard deviation of the random variable
X?

Question 56.
How many random variables are needed in order to calculate the covariance?

Question 57.
What does correlation measure in a data set? How can you use it to tell which sets of data are
similar?

Question 58.
What are the two measurements that can be used to measure the difference of distributions of
random variables form a standard distribution?

Question 59.
What distribution would you use to model whether or not a dice roll lands on 4 or larger num
ber?

Question 60.
What are the unique properties of the uniform distribution?

Question 61.
Which distribution is a special case of the binomial distribution? What is the parameter that
creates this special case?

Question 62.
Which distribution is related heavily to the central limit theorem?

Question 63.
What central tendencies does the normal distribution have?
3.4. DISTRIBUTIONS 69

Question 64.
Which distribution can be used to estimate the number of customers in a restaurant in a given
time frame?

Question 65.
Which distribution can be used to estimate the probability of a customer arriving in a given
time frame?
70 CHAPTER 3. DESCRIPTIVE STATISTICS
Chapter 4

Data Mining

It is said that every two years the amount of recorded data is doubled. This means, that every
two years more data is generated than there were ever before. This notion alone gives the
science community, and the industry, the motivation to develop new ways to handle data. The
handling of data has not been a significant problem in modern times. A huge number of
industries and other sectors accumulate gigantic data sets every day. The problem of measuring,
data handling, and data storage is also part of the data processing process, but it is usually
handled by data engineers. This knowledge of data engineering is entirely different from what the
reader can read about in this lecture note. Nevertheless, the subject is closely related to applied
data analysis, as the responsibilities of data engineering is providing the data for later processing
by statistical, and machine learning methods.
Data science encompasses many different sub fields, and applications. The foundation to these
processes is data analysis. Mathematicians have long developed methods and algorithms which are
useful when dealing with large sets of data. The field of statistics was specialised in making sense
of large data sets. Formal methods of statistics can be used in order to extract useful knowledge
about data.
Data analysis is the process of using methods and techniques to extract previously unknown
properties and information from organized data. Of course, there are a lot of methods borrowed
from the field of statistics.
The process of analyzing data can be as simple as looking at the data, and intuitively figuring out
inferences in it. In truth, many of who we now call data scientists have started their career as a
statistician. Recently, data science has drifted away from the theoretical world of mathematics, into
computer science.
The main goal of data analysis is to understand what kind of values, data types and other
quirks are included in the data set. The extraction of this information is essential, as methods and
applications heavily rely on this knowledge.

4.1 Data Structures


After discussing the data types usually found in data sets, the next step is to introduce data
structures that can be used during data analysis. Data is the plural form of the rarely used

71
72 CHAPTER 4. DATA MINING

word datum. Being rarely used suggests that the times where datum are singular is nearly non-
existent.

For this reason, creating structures of datum which represent complex data structures is
necessary. One of the simplest way to organize data is to put them into an array of a data type.
Mathematical data structures such as vectors and matrices were described in algebra long before
computers existed [13, 20].

Arrays

Example 4.1.1. Let’s imagine that you are interested in predicting the weather 1 hour from your
current time. In this example, all you have is a thermometer, paper and pen to work with.
If you start writing down the temperature after every hour, you will get an array of temperatures.
The logical way to record temperatures are with real numbers. In order to digitize this data, you
have to save it into a float, or double array.

After you have created this array, you have successfully put your data into a structure that is easy
to use, and understand.

Please note, that the data in this array is not indexed by time. The only indexing it contains, is the
sequence that the data is written down.

15,5 16,4 17,6 18,1 19,2 19,5 19,3 19,4 19,5 18,9

While arrays of data can be processed easily, they cannot be intuitively understood by looking at the
data alone. The values of temperatures could be represented by different units of measurements,
can be recorded at different times. Therefore, external information must be given in order for arrays
of data to be understood.

Matrices

Adding another dimension to an array results in the creation of a matrix. Matrices contain rows of
data, where each row can represent an array of data.

Just as in arrays, the data in a matrix is not labelled. Therefore, the structured data structure itself
does not provide important information about the data.

Example 4.1.2. The data from the last example can be structured further. After you have
recorded the temperatures of a day, you can begin recording temperatures of the next days. After
you have recorded the temperatures of a few days, you can arrange them into a matrix. The data
can be sorted so that each row can represent a different day, while each column represents a different
point in time.

15,5 16,4 17,6 18,1 19,2 19,5 19,3 19,4 19,5 18,9
15,3 15,6 16 16,2 17 17,2 17,3 15,7 15,3 15,1
15,6 15,7 16,2 16,9 18,1 19,2 19,9 20,8 19,6 18,3

This structure makes the processing of multiple days easy.


4.1. DATA STRUCTURES 73

4.1.1 Data set


A data set is a collection of data, organized in columns and rows. A data set can contain columns
of different data type. Each column can be labelled, which extends the semantic definition of the
vector of values it contains. Columns represent different variables of features of the stored data.
Rows of the data set represents a different instance of the collected data. Each column can be
any of the previously mentioned statistical data type, a string of characters, or logic values.
As stated previously, different levels of measurements can be applied for different column
types. One the widely used applications that can handle data sets is Microsoft Excel. Excel
organized data into tabular data in which separate data elements can be organized, graphed, and
calculated with.
Excel is one of the most widely available spreadsheet software available in the world. Most
schools teach at least basic use of Microsoft Excel, therefore it is a preferred tool by many. How-
ever, spreadsheet software often lack advanced features which could enable it to analyze big
data.
Other alternatives to excel include Google Spreadsheets, and OpenOffice Calc, which are free as
opposed to the Microsoft’s option.

If a spreadsheet software is lacking in features for a certain problem, it is advisable to use more
advanced tools, which were described in the previous chapter Tools. Advanced tools most often do
not have their own proprietary extensions, but some can parse spreadsheet encoding. However, it
is advisable to use CSV as a format for data processing tasks. All of the mentioned software are
capable of exporting data as a CSV.

Features
After parsing data into the spreadsheet software or data processing framework of choice, it es
sential to separate data based on description. Data sets usually come with descriptions of what
each feature is, and what is the purpose of the data set. Features are a column , representing
measurements in an abstract data set. Features could be any previously mentioned data type.
In order to parse data of certain types for example strings, or characters, special encoding techniques
can be used [11]. For example, standardization of numeric data can increase effectiveness of
methods. Encoding techniques can be used to transform otherwise unusable data to a process-able
format. Encoding techniques include:
1. One Hot Encoding

2. Label Encoding
3. Embedding
Some of those techniques will be explored in the following chapters.

Target Values

If your goal is to find inferences in the data in order to predict a feature, you are either classifying an
object, or solving a regression problem. In both of these cases, there are a number of features which
you use to predict a different feature. The feature you are trying to predict is usually called the
target feature.
74 CHAPTER 4. DATA MINING

Data set descriptions usually come with predefined target features, but you could choose any
feature as a target, as long as it is processable with statistical methods, or machine learning.
Example 4.1.3. In this example, a CSV will be imported from the local file system into Pandas. Some
components have been already explored in subsection 2.6.3.

[1]: import pandas as pd


import numpy as np

[2]: df=pd.read_csv('hotel_bookings.csv')

[3]: [Link]()

[3]: hotel is_canceled lead_time arrival_date_year arrival_date_month␣


↪→ \
0 Resort Hotel 0 342 2015 July
1 Resort Hotel 0 737 2015 July
2 Resort Hotel 0 7 2015 July
3 Resort Hotel 0 13 2015 July
4 Resort Hotel 0 14 2015 July

arrival_date_week_number arrival_date_day_of_month \
0 27 1
1 27 1
2 27 1
3 27 1
4 27 1

stays_in_weekend_nights stays_in_week_nights adults ... deposit_type \


0 0 0 2 No Deposit
1 0 0 2 No Deposit
2 0 1 1 No Deposit
3 0 1 1 No Deposit
4 0 2 2 No Deposit

agent company days_in_waiting_list customer_type adr \


0 NaN NaN 0 Transient 0.0
1 NaN NaN 0 Transient 0.0
2 NaN NaN 0 Transient 75.0
3 304.0 NaN 0 Transient 75.0
4 240.0 NaN 0 Transient 98.0

required_car_parking_spaces total_of_special_requests reservation_status␣


↪→ \
0 0 0 Check-Out
1 0 0 Check-Out
2 0 0 Check-Out
3 0 0 Check-Out
4 0 1 Check-Out
4.1. DATA STRUCTURES 75

reservation_status_date
0 2015-07-01
1 2015-07-01
2 2015-07-02
3 2015-07-02
4 2015-07-03

[5 rows x 32 columns]

[4]: [Link]()

In the following block, general statistics are displayed with the describe() function. It is always
good idea to check the data’s statistical parameters before transforming it. Sanity checking helps
the developer to avoid anomalies, and better understand the data set.

[4]: is_canceled lead_time arrival_date_year \


count 119390.000000 119390.000000 119390.000000
mean 0.370416 104.011416 2016.156554
std 0.482918 106.863097 0.707476
min 0.000000 0.000000 2015.000000
25% 0.000000 18.000000 2016.000000
50% 0.000000 69.000000 2016.000000
75% 1.000000 160.000000 2017.000000
max 1.000000 737.000000 2017.000000

arrival_date_week_number arrival_date_day_of_month \
count 119390.000000 119390.000000
mean 27.165173 15.798241
std 13.605138 8.780829
min 1.000000 1.000000
25% 16.000000 8.000000
50% 28.000000 16.000000
75% 38.000000 23.000000
max 53.000000 31.000000

stays_in_weekend_nights stays_in_week_nights adults \


count 119390.000000 119390.000000 119390.000000
mean 0.927599 2.500302 1.856403
std 0.998613 1.908286 0.579261
min 0.000000 0.000000 0.000000
25% 0.000000 1.000000 2.000000
50% 1.000000 2.000000 2.000000
75% 2.000000 3.000000 2.000000
max 19.000000 50.000000 55.000000

children babies is_repeated_guest \


count 119386.000000 119390.000000 119390.000000
76 CHAPTER 4. DATA MINING

mean 0.103890 0.007949 0.031912


std 0.398561 0.097436 0.175767
min 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000
50% 0.000000 0.000000 0.000000
75% 0.000000 0.000000 0.000000
max 10.000000 10.000000 1.000000

previous_cancellations previous_bookings_not_canceled \
count 119390.000000 119390.000000
mean 0.087118 0.137097
std 0.844336 1.497437
min 0.000000 0.000000
25% 0.000000 0.000000
50% 0.000000 0.000000
75% 0.000000 0.000000
max 26.000000 72.000000

booking_changes agent company days_in_waiting_list \


count 119390.000000 103050.000000 6797.000000 119390.000000
mean 0.221124 86.693382 189.266735 2.321149
std 0.652306 110.774548 131.655015 17.594721
min 0.000000 1.000000 6.000000 0.000000
25% 0.000000 9.000000 62.000000 0.000000
50% 0.000000 14.000000 179.000000 0.000000
75% 0.000000 229.000000 270.000000 0.000000
max 21.000000 535.000000 543.000000 391.000000

adr required_car_parking_spaces total_of_special_requests


count 119390.000000 119390.000000 119390.000000
mean 101.831122 0.062518 0.571363
std 50.535790 0.245291 0.792798
min -6.380000 0.000000 0.000000
25% 69.290000 0.000000 0.000000
50% 94.575000 0.000000 0.000000
75% 126.000000 0.000000 1.000000
max 5400.000000 8.000000 5.000000

4.2 Data Mining


Data mining is one sub field among many in Data Science [8, 7]. Data mining focuses on analyzing
data sets in order to extract new and useful information. This knowledge is used to create new
conclusions, and recognize relationships between attributes.
The first step of data mining usually consists of multiple sub-steps which process the data set.
This transformation is done in order for the data mining algorithms to work effectively.
The primary goal of data mining is to create transform data in such way that the created data sets
contain nothing but relevant information for a specific task.
4.3. DATA MINING PROCESS 77

Machine learning is another subset of Data Science, that is closely related to Data Mining. Machine
learning models are usually made up of black box algorithms Machine learning uses artificial
intelligence models to learn the inferences of a specific data set. The models are trained to recognize
attributes without explicit instructions.
Machine learning usually utilize the processed data set created by the data mining process.
The data set usually contains hidden knowledge and inferences, which are utilized to create
application which can seemingly make intelligent choices.

4.3 Data Mining Process


Massive data sets often contain more data, than a human could comprehensively understand. The
processing, and the extraction of useful information from massive data sets.
Data mining consists of well defined steps that make up the processing, and knowledge extraction
from the data. These steps are explored further in this section.

4.3.1 Problem definition


The first problem one might encounter is the problem definition itself. At the start of a data
science project or research, the problem should be explored thoroughly.
The following problem statements should be well defined before starting the further steps:
• What is the end goal of the project?
• What problem might be encountered during development?
• Potential solutions to listed problems.

4.3.2 Data Selection


After the goals have been set, appropriate data sets have to be found, and loaded. Data mining
requires an extensive amount of data in order to find meaningful information. There are two
scenarios:
• The data is already available, and the goal is to extract information.
• The goal is clear, but data need to be gathered, or data sets need to be found from other
sources.
The second option’s problem is finding of data that is appropriate for the given problem. There is a
wide array of data sets that can be found on the internet. One should look for open source data
sources in case the data is not being provided.
Websites, such as Kaggle and the UC Irvine Machine Learning Repository contains repositories
of data sets, and helpful materials for learning, and improving in data science and machine
learning.

4.3.3 Data Loading


After the data set has been selected, it needs to be loaded. The size, and complexity of the data
often dictates the tools to use. Smaller, and simple data sets such as small CSV files can be
78 CHAPTER 4. DATA MINING

Vehicle Name # of Wheels Has Wings Speed Colour

loaded into Microsoft Excel, or other spreadsheet managing programs. These tools can be used to
create simple statistics, diagrams, and extract information from smaller data sets.
On the other hand, complex data sets that contain high number of features require more sophisticated
tools to manage. These tools often require a deeper knowledge in the field of computer science to
manage. One of the most widely used tools for managing enormous datasets is the Pandas package
of the Python programming language.

Feature Selection
Data sets often contain a lot of different features. These features describe different values of
specific elements. These instances often contain redundant, or non-relevant data to the problem
at hand.
Therefore, collecting information about the features of the data set is essential for making informed
decisions about which features to omit.
After the features has been selected, they can be extracted into a data set with reduced size. The
key to selecting features is to not lose any information related to the solution of the problem.
The most important tool while selecting data is common sense. Complex systems can detect
certain attributes, or compare elements in order to calculate a ratio, but they cannot deduct which
scenarios are impossible.
Example 4.3.1. For example, let’s look at a data set, which contains information about vehicles. The
following features can be found in the data set:
If your task was to analyse boats in this data set, it would be sensible not to include the "# of
Wheels" and "Has Wing" features. Of course, if you simply omit these features vehicles with wheels
or wing would be indistinguishable from boats. Therefore, the data would have to be filtered first in
order to remove rows containing other vehicles.
Example 4.3.2. Let us look at a data set with the following features:

Name Country Age Sex AVG. Work Hours AVG. Salary / Month
Height Weight Birthday Date Martial Status # of Pets Education

Question 66.
Imagine working at a company, where your job is to analyse worker efficiency, and satisfaction.
Worker satisfaction is rated from 1 (Worst) to 10 (Best).
Which feature would you omit, in order to increase efficiency of your machine learning system?
After you have thought about the question for a moment, you may have realised that choosing
features based on these criteria is not an easy task. You could guess what kind of data types a
feature might hold, but without this knowledge, it is hard to make an informed decision.
Methods have been already created that are capable of selecting features that are redundant for
the solution of the problem. While these methods are important, they cannot replace common
sense.
4.3. DATA MINING PROCESS 79

For this reason, the first step should always be the manual analysis of the data set. This
increases the effectivity of the aforementioned automatic feature selection methods, because the
algorithms have a reduced number of features to work with.
The following two algorithms are often used in order to reduce the number of redundant fea tures
in data sets.
Example 4.3.3. Let us once again load the data with the panda’s package. The Iris data set will be
loaded, and different selection methods will be presented.

[1]: import pandas as pd


import numpy as np

[2]: df=pd.read_csv('[Link]')

[3]: [Link]()

[3]: Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species


0 1 5.1 3.5 1.4 0.2 Iris-setosa
1 2 4.9 3.0 1.4 0.2 Iris-setosa
2 3 4.7 3.2 1.3 0.2 Iris-setosa
3 4 4.6 3.1 1.5 0.2 Iris-setosa
4 5 5.0 3.6 1.4 0.2 Iris-setosa

The column names of the data set can be examined by viewing the [Link] object of the data
frame.
[4]: [Link]

[4]: Index(['Id', 'SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm',


'Species'],
dtype='object')

Columns and rows can be selected using 2 different selection methods.


1. .loc[] which returns rows/columns based on the name.
2. .iloc[] which returns rows/columns based on index.
In the first example we will select the first 100 rows, and the columns named "SepalLengthCm" and
"SepalWidthCm". Rows are indexed by numbers, if they are not named, otherwise they would
have to be accessed by their names as well.

[5]: df_selected=[Link][0:100, 'SepalLengthCm':'SepalWidthCm']

[6]: df_selected.head()

[6]: SepalLengthCm SepalWidthCm


0 5.1 3.5
1 4.9 3.0
2 4.7 3.2
80 CHAPTER 4. DATA MINING

3 4.6 3.1
4 5.0 3.6

The second example uses selection based on indexes. Elements after the 100th rows are selected
that are after the third column.
[7]: df_selected_id=[Link][100:, 3:]

[8]: df_selected_id.head()

[8]: PetalLengthCm PetalWidthCm Species


100 6.0 2.5 Iris-virginica
101 5.1 1.9 Iris-virginica
102 5.9 2.1 Iris-virginica
103 5.6 1.8 Iris-virginica
104 5.8 2.2 Iris-virginica

4.3.4 Principal Component Analysis


Principal Component Analysis (PCA) [23, 2] is a method, which attempts to find a good N
dimensional representation of the original M dimensional data. Performing a PCA on M
dimensional data set produces N number of principal components, of which best describe the
data set in the dimension N. PCA is often used to reduce the dimensionality of data in order to
increase effectiveness, and better visualize high-dimensional data.
PCA determines important features algebraically, which contribute the most to the overlying
information of the data set. For this reason, PCA can be also used to prevent overfitting in ma chine
learning models, as PCA will drop unimportant relationships between data. The outcome of a PCA
algorithm will be M number of principal component, which will be a combination of the original
features in some way.
PCA is a very important algorithm that you’ll most likely use in the future. When should you
use it?
1. You would like to reduce the dimensionality of your date
2. You would like to make sure that the resulting principal component vectors are indepen
dent of one another.
3. You don’t need the semantic information of the original features
PCA is an orthogonal linear transformations, which transforms the original data set into a new
coordinate system. The new coordinate systems’ coordinates will depend on the data set fea
tures’ covariance. The feature with the highest covariance of the original data set will deter-
mine the first coordinate, the second highest covariance determines the second, and so on until
the Nth coordinate. This is done by calculating the eigenvalues of the covariance matrix. The
eigenvalues are independent of each other, while determining essential information axes of the
original data set.
Sorting these eigenvalues will result in a matrix in which the eigenvalues are sorted along the
main axis. The new features are achieved by taking the sorted Eigen matrix, and multiplying
it by the standardized data set. The resulting data set will contain principal components that
4.3. DATA MINING PROCESS 81

are orthogonal to each other. Therefore, principal components are linearly independent to each
other. Lastly, N number of these principal components are selected in order to transform them into
the dimension of N.

Example 4.3.4. Luckily the algorithms and methods read in this lecture note have already been
implemented, and can be used readily by the reader. The PCA algorithm has already been
shown in chapter 2.

Nevertheless, we will proceed with the new found knowledge and implement it again with
three principal components.

Once again, the training data, and targets are separated. The data is scaled, while the targets are
encoded using the LabelEncoder class.

[1]: import pandas as pd


import numpy as np
import matplotlib
import [Link] as plt
from [Link] import StandardScaler
from [Link] import PCA
from sklearn import preprocessing

[2]: iris_df=pd.read_csv('[Link]')

[3]: x, y=iris_df.iloc[:,1:-1], iris_df.iloc[:,-1:]

[4]: x_scaled=StandardScaler().fit_transform(x)

[5]: le = [Link]()
y_num=le.fit_transform(y)

The following block contains the PCA transformation of the original data set. The PCA is set to
extract 2 principal components with the n_components parameter, resulting in a two dimensional data
set. The principal component data frame is created with the columns of pc1, pc2.

[6]: pca = PCA(n_components=2)


principalDf =[Link](data = pca.fit_transform(x_scaled), columns =␣
↪→['pc1', 'pc2'])

[7]: fig = [Link](figsize = (8,8))


[Link]([Link][:, 0], [Link][:, 1], c=y_num)

[7]: <[Link] at 0x21c649be548>


82 CHAPTER 4. DATA MINING

The following code block creates a three dimensional principal component data frame with
n_components set to 3. After the components have been extracted, the three dimensional data frame
is plotted using matplotlib’s three dimensional plotting capabilities.

[8]: from mpl_toolkits.mplot3d import Axes3D


fig = [Link](1, figsize=(8,8))
[Link]()
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=45, azim=110)

[Link]()
pca = PCA(n_components=3)
principalDf =[Link](data = pca.fit_transform(x_scaled), columns =␣
↪→['pc1', 'pc2', 'pc3'])
4.3. DATA MINING PROCESS 83

[Link]([Link][:, 0], [Link][:, 1],[Link][:,␣


↪→2] , c=y_num, cmap=[Link].nipy_spectral,

edgecolor='k')

[Link]()

4.3.5 Singular Value Decomposition


Singular Value Decomposition [6] is a matrix factorization methods that takes a matrix M as an
input, and produces U, Σ, Z The method is similar to PCA in technique, execution, and result.
In fact, SVD can also be used to find the principal components of the matrix during PCA. SVD
decomposes a matrix into its singular values, that are the core components along orthogonal
axes in an N dimensional space. The main difference between PCA and SVD is that PCA does
not center and standardize data before decomposition. For this reason, SVD can work on sparse
84 CHAPTER 4. DATA MINING

matrices, while PCA cannot! However, SVD also suffers from sign indeterminacy, which means that
the output largely depends on the starting random state. Truncated SVD is a variant of SVD
which is used to truncate the resulting Matrix of values into a certain dimension. This is the variant
used to reduce the dimensionality of data, as it can convert the data into a different dimension by
truncating singular value vectors.

When should you use it?

1. You would like to reduce the dimensionality of your date

2. Your data set is sparse.

3. You don’t need the semantic information of the original features

Example 4.3.5. In this example, the use of Scikit-Learn’s TruncatedSVD class is demonstrated.
The class performs the truncated SVD algorithm in order to reduce the dimensionality of the
data.

[1]: from [Link] import TruncatedSVD


import pandas as pd
import numpy as np
import matplotlib
import [Link] as plt
from [Link] import StandardScaler
from sklearn import preprocessing

[2]: iris_df=pd.read_csv('[Link]')
x, y=iris_df.iloc[:,1:-1], iris_df.iloc[:,-1:]
x_scaled=StandardScaler().fit_transform(x)
le = [Link]()
y_num=le.fit_transform(y)

The following code block contains the creation of the singular value vectors. The passed pa rameter
of the TruncatedSVD class is target dimension.

[3]: svd = TruncatedSVD(2)


svdDF = [Link](data =svd.fit_transform(x), columns = ['sv1', 'sv2'])

[4]: fig = [Link](figsize = (8,8))


[Link]([Link][:, 0], [Link][:, 1], c=y_num)

[4]: <[Link] at 0x2189d6b9a08>


4.3. DATA MINING PROCESS 85

4.3.6 Feature Engineering


Data filtering is one of the most important part of any data processing or machine learning
process. After checking the data by either hand, or with an algorithm it is not unusual to find
anomalies, and faulty data. Raw data might not be processable in the first place, for example
with text data. Mapping these data into a different representation creates processable feature
vectors.
Features values can also be examined in order to develop intuition about what kind of features you
can extract from the raw data.
Example 4.3.6. Data sets can contain a number of IDs, depending on what kind of dataset is
used. Unique identifiers are used in order to easily distinguish elements in a data base.
However, during data analysis, or machine learning ID-s provide no real information on the
problem. Therefore, including ID-s in feature vectors should be avoided.
Determining whether data is faulty or not is usually done by hand by the developers. The statis
tical values of features can be telling about faulty data. For example, using [Link]() presents
a statistical overview of the data frame. The statistical overview can be used to detect extrem
86 CHAPTER 4. DATA MINING

ities in our data set. The minimum, 25% quartile, median, 75% quartile and the maximum is
particularly useful in finding outliers in our data.
Example 4.3.7. If all of your data is increasing linearly up until the 75 %quartile but jumps
exponentially in the last quantile, you might want to inspect it. A few outlier data points can
potentially alter the outcomes of statistical methods and machine learning models in a way that the
outcomes might not represent the original data well.
There are a few solutions to this problem, which are transforming feature data using a logarith
mic function, or clipping. Using the logarithmic value of the feature values can minimize the
effect of extreme outliers. Clipping is the process of deleting rows of data based on a logical
criteria. If you have the intuition, or knowledge about a subject, and after you check the data,
you find outliers that could not exist in the real world, it can be considered as a measurement
error. The best action to take in this case is to clip the data above, or below the threshold.

4.3.7 Feature Crosses


The problem with a lot of statistical models, and machine learning tools is that they are only able
to calculate linear outcomes.
In classification, the data is linear, if you can separate data points by class with a single line.
If that is not the case, a non-linear….? needs to be included in order to successfully classify
data. Features are crossed in order to introduce non-linear data into the feature vectors. Crossing
two vectors is as easy as taking the dot product of the two vectors.

[2]: iris_df=pd.read_csv('[Link]')

[3]: iris_df

[4]: x, y=iris_df.iloc[:,1:-1], iris_df.iloc[:,-1:]

[5]: x["SepalPetalLenghtCm"] = x["SepalLengthCm"] + x["PetalLengthCm"]

[6]: x

[6]: SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm \


0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2

145 6.7 3.0 5.2 2.3


146 6.3 2.5 5.0 1.9
147 6.5 3.0 5.2 2.0
148 6.2 3.4 5.4 2.3
149 5.9 3.0 5.1 1.8

SepalPetalLenghtCm
4.3. DATA MINING PROCESS 87

0 6.5
1 6.3
2 6.0
3 6.1
4 6.4

145 11.9
146 11.3
147 11.7
148 11.6
149 11.0

[150 rows x 5 columns]

Question 67.
What is the data structure which can be used to store and organize a set of measurements?

Question 68.
What data structure would you use to store several sets of measurements?

Question 69.
What is a difference between a data set and a matrix?

Question 70.
What data types can a data set hold?

Question 71.
What software can be used to create and analyze small data sets?

Question 72.
What is the relationship between the features and the target of the data set?

Question 73.
What are features? What is the difference between raw data array and a feature vect or?

Question 74.
What encoding techniques can be used to encode categorical data?

Question 75.
What sort of data types can target values be?

Question 76.
How can you display a descriptive analysis of the data set with pandas?

Question 77.
What is the name of the subfield of data science which is used to analyze and extract information
from data?
88 CHAPTER 4. DATA MINING

Question 78.
What are the steps of the data mining process?

Question 79.
What is the purpose of defining the problem at the start?

Question 80.
Which sources can be used for data sets?

Question 81.

What tools are recommended when working with big and complex data sets?

Question 82.
What is the point of selectively choosing features from the original data set?

Question 83.
What category of algorithm does the principal component analysis fall under?

Question 84.
What does the PCA algorithm achieve?
Question 85.
What are the essential steps of the PCA algorithm?

Question 86.
How can the PCA algorithm be used in visualization?

Question 87.
What are the similarities between the PCA and the SVD algorithm?

Question 88.
What is the name of the variation of the SVD which can be used to reduce dimensionality?

Question 89.
What is the problem with using raw data during data processing?

Question 90.
What are the types of data that have to be transformed in order to be use d as features?

Question 91.
What is the benefit of using feature crosses?

Question 92.
Go to [Link] and download a dataset of your choice. Try to import it into Excel, and create
bar plots and pie plots of certain columns. Try to import the data set into a pandas data frame
using pandas.read_csv(csvname). After importing the data, look into what functionalities can
be used on DataFrames using the following documentation .
Chapter 5

Machine Learning

In this chapter, we will talk briefly about the recent developments of machine learning tech
nologies, and methodologies. Artificial Intelligence has always moved the human mind greatly
throughout history. Even ancient Greek philosophers wrote epics about moving, acting, and
thinking machines. While the idea has remained science fiction for the majority of human his-
tory, it has recently gained ground as a truly viable, and achievable goal for humans.
Modern Artificial Intelligence [16, 5] will most probably not be able to freely think, or feel for
themselves, but they are able to solve problems beyond human capabilities. Therefore, machine
learning models are not considered "intelligent", but rather an excellent tool to solve specific
problems.

Programmers have been able to solve complex problems for more than half a century. While
computer science is not an ancient science, it has achieved much over its short life span. The dif
ference between programming a solution, and machine learning is the explicit programming.
Most problems could be solved by a machine by explicitly programming it to solve the ab
stracted problem.

The machine uses it’s memory, and it’s central processing unit in order to calculate it’s next action,
although it is not acting on its own whims. The machine is just automatically performing the actions
that it’s programmed to do.

Machine learning however, takes a different approach to problem solving. This approach takes
a methodical, and formal examination of the problem, and the data in order to create learn
ing models. The biggest requirement of these types of systems is data. In the last 30 years the
amount of recorded data has been growing exponentially. The other reason machine learning
models can thrive in the 21st century is the growing power of computers. Machine learning
models could not have been made any earlier, as they usually require huge amounts of com
puting power.

Therefore, the two biggest requirements for machine learning are computing power and data. If one
possesses both of these requirements, modern machine learning systems can be integrated into
applications, or used independently to predict complex outcomes.
Machine learning processes can be categorized into two distinct classes, supervised, and un -
supervised learning processes. We’ll briefly discuss the particularities of both categories in the

89
90 CHAPTER 5. MACHINE LEARNING

following sections.

5.1 Supervised Learning


Humans, and animals in general have different methods and ways of learning. Receiving
information from the outside world from different sources can improve one’s ability to perform
certain tasks. The information can be either abstract lessons, or practical insights. The first type of
machine learning category can be thought of as concept learning for humans.
Concept learning is the abstract formulation of a type of learning in which the participant re
ceives information of different examples, and their corresponding, distinguishable categories.
Common examples can be categorized into separate mental classes. The result of accumulat
ing large amount of examples, and their corresponding mental classes is generalization. Using
generalization, one can imagine the outcome of new examples by piecing together already ex
perienced examples.
Supervised machine learning [4] is the idea of concept learning, implemented in an algorithmic
form. There are many different algorithms that implement the idea of supervised learning. The
goal of these algorithms are always to learn a set of outcomes from available data associated to
individual outcomes. The data are the feature vectors, and the outcomes are the targe t feature
vector.

5.2 Training and Loss


Let’s say that we have a complex system that uses N number of inputs in order to generate an output
t. Let’s formulate the function in the following way:

f : vN → t, where v, t ∈ R

The f function is the mathematical function equivalent of the original system. If we do not
know how the system processes the input in order to create the result, the system is essentially
a black box.
The goal is to mirror the system, by creating a function g with the same mapping as the original.
Mathematically, we can create the following basic formulation for supervised learning:

g:vN →t, wherev,t∈R

Where g is the learning function that we use to learn the original function f. The vN is the
N long feature vector, and t is the target we’re trying to predict. The "learning" is trying to
approximate the function f with different methods without explicitly knowing the original
vN → t mapping.
Of course, we can also formulate a number which describes the difference between f and g:

L=l(f(v)−g(v))

The L is usually called the loss of the model. It is called a loss because it measures the difference
between the actual output, and the predicted output from the function g. The function l is
called a loss function. The loss function is some mapping of the value that will result in a good
5.2. TRAINING AND LOSS 91

measurement of loss value. This function can be different for different tasks. For example, if you
would like to classify cats and dogs, you would like to use a loss function that calculates the
amount bad answers made by function g.
If the loss is 0, the functions f and g are identical, and we have successfully created a machine
learning model which outputs the same exact result as our black box f function. Therefore our
goal should be to minimize L by changing the learning function g to better learn the mapping
of f.
Of course, this does not happen, and should not happen because approximating a function will
always have errors. The original f function will work whatever input combination it results,
and will always output the correct result. As the function g learns from examples, it’s knowl
edge goes only as far as the experienced input, and output combinations. So if g receives an
unknown combination of feature values it might output slightly different values.

5.2.1 Train-Test split


Training the machine learning model is done by feeding it input, and output data in order to
increase the accuracy of the approximation method. Increasing the amount of data also in -
creases the accuracy, as the model will have more examples to learn from in order to predict
new outcomes.
However, there is an important remark we’ll have to go over, which is the importance of testing
machine learning models. Testing a machine learning model is as simple as feeding the data,
and calculating the loss of the outcome relative to the true results. However, the model must not
be biased towards previously learned data while testing. If the same set of data is used during
testing as training, the model would’ve had already learned the outcome of those combination
of inputs. Therefore, testing with the same inputs and outputs as the model trained with is not
a good idea.
Therefore, data must be split into at least 2 parts:
1. Training set
2. Testing set
The model trains with the training set, calculating losses, and modifying the model accordingly. After
a desirable accuracy has been reached, the testing set is used to evaluate the trained model. This
provides a much better testing environment for our model, as model accuracy can be tested on
previously unprocessed data.
Another important aspect of splitting is random shuffling of data. Data might contain some sort
of ordering between elements that is tied to the indexing of the data set. If the data is split statically,
the ordering will remain in the data, and during learning the model might not learn important
elements of the data set that only appear at the end of the data set.

Train-Test-Validation split

The data set can be split even further into three separate data sets, which increases the testing
accuracy, and generalization ability of the model. The training, and the testing set remain, but
a new data set is created called the validation set. The validation set is used for calculating the
92 CHAPTER 5. MACHINE LEARNING

loss after the training phase in order to determine model accuracy. Performing modifications on
the model could result in an increase of accuracy on the validation set.
This is required if the particular model has parameters which are not tied to the training itself, but
rather serve as an inner parameter of the model. The process of finding such high -accuracy
parameters is called hyper parameter search. Finally, after achieving desirable accuracy on the
validation set, the testing set is used to evaluate the final trained model.

Split size
The problem with splitting the data set is that the model will have less data to train on. How -
ever, if the testing data is not used during separate testing phase, model generalization ability
might suffer. Therefore, it is a balancing act to determine the right ratio of training, and testing
data. Decreasing the amount of training data will result in a less accurate model, but disre
garding the testing data creates a situation where there is no way to properly test the model
accuracy.
It is advised ratio of split is 77% training data, 33% testing data. However, it is always important to
keep in mind that there is no golden rule. Smaller data sets might require smaller testing sets, as models
might not be able to properly learn from a small sample of data.
Example 5.2.1. In this example, we will use scikit learn’s train_test_split() method to easily split the data
set into 2 parts:
1. Training feature vectors and target features
2. Testing feature vectors and target features
The dataset will be used in regression problems in a later example.

[9]: import [Link] as plt


import numpy as np
import pandas as pd
from sklearn import datasets, linear_model
from [Link] import mean_squared_error, r2_score, mean_absolute_error
from sklearn.model_selection import train_test_split

[10]: diabetes_X, y = datasets.load_diabetes(return_X_y=True)


X = diabetes_X[:, [Link], 2]

X_train, X_test, y_train, y_test = train_test_split(


X, y, test_size=0.33, random_state=42)

5.2.2 Linear Regression


Linear Regression [17] is the simplest example of supervised learning algorithms. It originated from
statistics, and it is one of the fundamental algorithms in modern artificial intelligence and machine
learning.
Regression analysis is the process of predicting an unknown variable from one, or more predic
tor variables. Regression tasks are also one of the categories of machine learning problems. If
a problem involves regression, it describes a problem where there is no discrete set of output
5.2. TRAINING AND LOSS 93

values. Instead, the output values are the product of a continuous function, therefore they are real
numbers. Examples of regression tasks are housing price prediction, and forecasting. Lin ear
regression is among the simplest form of regression. Regardless, linear regression is often used
to this day to approximate simple functions.
In linear regression tasks, a combination of input features are used in order to calculate an
output value in an N dimensional plane. With one predictor feature, the linear regression model can
be formulated as:
f :x→ ŷ
Where x is the input feature, y is the target value and ŷ is the predicted outcome. The goal is to
determine the linear function f in which the loss between ŷ and y is minimal.
The linear function with 1 feature can be formulated as:

f(x) = w0 + w1 ∗ x

This formulation is same as the equation of the line, y = m ∗ x + b where m is the slope, and b is the
y-intercept value. In this equation, the y, and x are set, as the y are the actual output, and x is the
input variable. In order to change the model predictions, w0 and w1 can be changed. Changing
these parameters changes the outcome of the function, and therefore the loss. Originally, the function
was optimized using the least squares method, but a number of different loss measurements has been
developed over the years:
1. Mean Squared Error, or MSE.
n
1
Loss =
n
∑ (target value − predicted value)2
i=1

MSE calculates the squared difference between the actual output and the predicted out put.
This error is the is used in the least squares method, and provides a good measure of estimator
quality. 0 MSE means that the estimator predicts the same output as the original outputs for
every input. The problem with MSE is that outlier values can generate large amount of error,
which can be problematic during training.
2. Mean Absolute Error, or MAE.

n
1
Loss =
n
∑ |target value − predicted value|
i=1

MAE calculates the average absolute error of the given predictions. The error does not
account for direction. MAE handles errors in an absolute manner. In case that the errors can
be represented along a linear scale, MAE can be a useful measurement of error. For example,
in case the error measurement of 10 is half as bad 20.
3. Root Mean Square Error, or RMSE

1 n
Loss =
n ∑ (target value − predicted value)2
i=1

The RSME takes the root of the MSE error and is indifferent to direction of the error. RSME
represents the quadratic mean of the differences between the actual, and the predicted
94 CHAPTER 5. MACHINE LEARNING

output. The problem of outlier data is still present in RMSE, although in a less problematic
way. RSME is useful for calculating the loss function when the large errors with outliers are
particularly undesirable. If errors are on a quadratic scale, RMSE can be a particularly useful
measurement of error.

Please note that n is the number of rows in the features.

Linear regression can be easily visualized in two dimensional plots, but can be performed on
higher dimension data sets. One of the most important attributes of linear regression model is
that they are only capable of finding linear relationships between features and the target. In
short, the algorithm is trying to find a straight line across the data points which best represents
the linear relationships between the feature vectors, and the target feature. If the data points
are present in the given N dimensional environment in such way, that a straight line cannot
represent them well, then linear regression is not an appropriate model for the task.

Example 5.2.2. In this example, a simple linear regression model will be created using the pre
viously split Diabetes data set. Scikits LinearRegression() model is used to fit the line onto the
data set. Using the fit() method trains the model with the training data Xtrain and the target
features ytrain. Finally, the predicted targets are calculated using the .predict() function with
the test feature values.

[11]: regr = linear_model.LinearRegression()


[Link](X_train, y_train)
y_pred = [Link](X_test)

The slope, and the y intercept is inspected using the regressor’s coefs_, and .intercept_ attributes.
Previously discussed loss measurements are calculated using scikits mean_sqare_error() and mean_absolute_error() functions.

[12]: print('Coefficients: \n', regr.coef_)


print('Intercept point: \n', regr.intercept_)
print('Mean squared error: \n', mean_squared_error(y_test, y_pred))
print('Mean Absolute Error: \n', mean_absolute_error(y_test, y_pred))
print('Root Mean Squared Error: \n', [Link](mean_squared_error(y_test,␣
↪→y_pred)))

Coefficients:
[972.87627375]
Intercept point:
150.2626749624518
Mean squared error:
3934.0672763273196
Mean Absolute Error:
50.968422060002595
Root Mean Squared Error:
62.722143428994194

A bar plot is generated in order to show the difference between the first 25 predicted and actual output
value. Errors between the two outputs can be seen in the plot.
5.2. TRAINING AND LOSS 95

[15]: df = [Link]({'Actual': y_test.flatten(), 'Predicted': y_pred.flatten()})


df1 = [Link](25)
[Link](kind='bar',figsize=(16,10))
[Link](which='major', linestyle='-', linewidth='0.5', color='green')
[Link](which='minor', linestyle=':', linewidth='0.5', color='black')
[Link]()

A line plot is generated in order to show the predicted line across the actual outputs.

[14]: [Link](X_test, y_test, color='grey')


[Link](X_test, y_pred, color='green', linewidth=3)
[Link](())
[Link](())
[Link]()
96 CHAPTER 5. MACHINE LEARNING

The example shows a data set on which it is hard to find a line which fits the data well. All
measurements of error are relatively high, and therefore a linear regression model might not be the
best solution to our problem.

5.2.3 Logistic Regression


Logistic Regression [24] is a different form of regression, which is used to model the probability of
distinct classes. Logistic regression models are a different type of machine learning models called
classifiers. For each distinct class of outcomes, the logistic regression model will output a probability,
or an odds of a certain outcome occurring. If there are only two possible outcomes, the model is
called a binary logistic model. Two distinct classes appear regularly in classification tasks. The odds
can be used to to predict the actual outcome of the input variables. The odds are calculated by the
following formula:

P(x)
1−P(X)

Let us say that out of 100 tries, a person will have 0.25, or 25% probability to guess the right
answer to a question with just one hint. Therefore, the odds are 0.25 0.33333333333. If for
1−0.25 =
some reason, they would be able to guess 80% of the time with one hint, the calculation would
be 0.8 4. As you can see, the difference between probability and odds is that odds are not
1−0.8 =
bounded between 0 and 1, but between 0 and ∞.
The trick is to take the logarithm of the odds in order to transform it onto a range of ( −∞, ∞). This
transformed log-odds value is linear across all probabilities. The goal of logistic regression is to use
an estimator model to approximate the log-odds mapping of the original data. For this, we use
regression analysis.
The problem becomes:
w0 + w1 ∗ x = log(odds)
5.2. TRAINING AND LOSS 97

Where x is the feature vector (in this example, the hints). The line parametrized by w0 and w1 is
called the decision boundary of the logistic regression. The line creates two distinct classes in the
data, and depending the odds of the input, the data can be categorized into these classes based on
which side the odds is. Luckily, the probability of the outcomes, which are called logits, can be
calculated using the logistic function.

1
f(x) =
1+e−x

The cost function is that we calculate the odds of the input values using the two variables w0 and
w1 and logistic function.
Logarithm is used during the error function calculation. The error function in the case of logistic
regression called the logistic loss.
The individual loss of a given data point can be calculated by:
{
= -log(logistic(xi)) if y = 1
Lxi -log(1-logistic(xi)) y=0

Where y is the actual class of the given input row xi. In order to calculate the loss of the logistic
regression, the mean sum of all errors has to be calculated.

m 1
m ∑LLx x=i
i=1
If there are more than two possible categories of outcomes, multinomial logistic regression models are
used.
Example 5.2.3. The following example will contain the use of logistic regression on a three class
classification task. The Iris data set is used as the data, where the selected features are the sepal
length, and width, and the target feature is the variety of iris flower encoded onto an integer
array.

[1]: import [Link] as plt


import numpy as np
import pandas as pd
from sklearn import datasets, linear_model
from [Link] import mean_squared_error, r2_score, mean_absolute_error
from sklearn.model_selection import train_test_split

[2]: iris_X, y = datasets.load_iris(return_X_y=True)


X = iris_X[:, :2]

X_train, X_test, y_train, y_test = train_test_split(


X, y, test_size=0.33, random_state=42)

[3]: regr = linear_model.LogisticRegression()


[Link](X_train, y_train)
y_pred = [Link](X_test)
98 CHAPTER 5. MACHINE LEARNING

[4]: y_pred

The predicted values for the flower variety can be inspected as the return value of the regressor’s
predict() function.

[4]: array([1, 0, 2, 1, 2, 0, 1, 2, 1, 1, 2, 0, 0, 0, 0, 2, 2, 1, 1, 2, 0, 2,
0, 2, 2, 2, 2, 2, 0, 0, 0, 0, 2, 0, 0, 1, 2, 0, 0, 0, 1, 2, 2, 0,
0, 1, 2, 2, 2, 2])

As the predicted values are discrete, the values of the bar plot show which values were predicted
incorrectly by the logistic regressor.

[5]: df = [Link]({'Actual': y_test.flatten(), 'Predicted': y_pred.flatten()})


df1 = [Link](25)
[Link](kind='bar',figsize=(16,10))
[Link](which='major', linestyle='-', linewidth='0.5', color='green')
[Link](which='minor', linestyle=':', linewidth='0.5', color='black')
[Link]()

The following figure shows the data points with x axis being the continuous sepal width, and the y
axis being the flower type. The color blue is assigned to the actual target values, while the red dots
represent the predicted values. It can be clearly seen, that the regressor has misclassified some of the
data.
[6]: [Link](dpi=150)
[Link](X_test[:,1], y_test, color='blue')
[Link](X_test[:,1], y_pred, color='red')
5.2. TRAINING AND LOSS 99

[Link](())
[Link](())
[Link]()

The last figure shows the data points in a two dimensional space, along the sepal length, and the
sepal width axes. The colour of the dots represents the actual target values, and the background
colour represent the predicted decision boundaries.

[7]: [Link](dpi=150)
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
h = .02
xx, yy = [Link]([Link](x_min, x_max, h), [Link](y_min, y_max, h)) Z
= [Link](np.c_[[Link](), [Link]()])
Z = [Link]([Link])
[Link](1, figsize=(4, 3))
[Link](xx, yy, Z, cmap=[Link])
[Link](X[:, 0], X[:, 1], c=y, edgecolors='k', cmap=[Link])
[Link]('Sepal length')
[Link]('Sepal width')
[Link]([Link](), [Link]())
[Link]([Link](), [Link]())
[Link](())
[Link](())
100 CHAPTER 5. MACHINE LEARNING

5.3 Training Process


The training process of various models is one of the most essential elements of machine learn ing.
Training is the process of feeding the training model data, and predicting a target value. The
target value is then put into a loss function, some of which were discussed in subsection 5.2.2
and subsection 5.2.3.

The calculated error is then used to iteratively modify the model’s weights and biases in order to
perform better at the given task. The modification of these parameters can be different for different
models. We are going to present the method of machine learning training on a Feed Forward
Artificial Neural Network Model (Feed Forward ANN).

5.3.1 Perceptron
Artificial neural network models have been one of the most important technological tools of the past
10 years. However, artificial neural networks have existed long before that.

The idea of a network of artificial neural networks originates from the 1970s, when mathematicians
formulated an artificial agent. The agent was called a Perceptron [15], and it was capable of processing
incoming data through incoming connections, and then outputting the processed outcome though an
outgoing connection to the environment. The basic structure of the artificial perceptron can be seen
on Figure 5.1.
5.3. TRAINING PROCESS 101

The idea behind the perceptron is to transform the list of input values in a way that outputs a
given target output. The similarities to the previously discussed regression models in subsection
5.2.2 and subsection 5.2.3 are not a coincidence. Perceptrons essentially perform the same tasks as
the regression models.

For a given input vector X, the perceptron outputs an output ŷ. In linear, a linear function was
searched for given data points in such way that the difference between the output of the linear
function and the original dataset was minimal. Perceptrons perform the same task, by weighting
certain input values. Input connections of the perceptron represent the different features of the
data set.

A perceptron constructed to learn a data set with two features and one output target would have
two input connections, and one output connection. The different input connections would transmit
the different feature values to the perceptron. After receiving input, the perceptron calculates a
linear combination of the inputs. Each input is weighted in the linear combination according to a
weight parameter of the connection.

For two input values, the calculation uses the following formula:

(x1 ∗ w1 + x2 ∗ w2) + b

Where x1 and x2 are the feature values, w1 and w2 are the connection weights, and b is a bias.
The bias is a unique weight of the neuron, representing the y intercept value of the linear func
tion.

Finally, the result of the linear combination of inputs and weights is passed through an activation function
f.

o= f((X∗W)+b),W =[w1,w2],X=[x1,x2]

X∗Wisthedotproductoftheinputvector,andtheweightvector.
102 CHAPTER 5. MACHINE LEARNING

Activation Functions

Activation functions can be any function, but specific functions have become standards over the
years. Originally, no activation function has been used, and this can be represented with an identity
function.
f(x) = x
Identity function can be useful when trying to solve regression problems with a perceptron. The
problem with the identity function is that the linear combination of values is only capable of
producing linear outputs. Therefore, if the data set contains non-linear relationships between
features, the perceptron will not be able to predict them effectively. This was such a big problem, it
originally caused one of the AI Winters.
Nobody thought that the perceptron was of any use, and they abandoned the idea altogether.
The research surrounding perceptrons and neural nets continued after they introduced new,
non-linear activation functions. Non-linear activation functions allow the modelling of non-
linearity of the data set. The first non-linear activation function was the hyperbolic tangent (tanh())
function.
1 − e−2x
tahn(x) =
1+e−2x

As mentioned before, the tanh function is a non-linear function. It is bounded in the range of
(-1, 1). This unique boundary gives the perceptron unique a unique output transformation that
can be negative. After the success of the tanh function, many more activation functions have
been identified to have unique and useful properties for the output transformations.

The logistic function has been discussed before in subsection 5.2.3. The logistic function itself
is a non-linear function, that transforms any value into the range of (0,1). This can be useful,
as it transforms values into the boundaries of probability. In modern machine learning the
logistic function has been called the sigmoid function, and both of these terms refer to the same
function.

1
σ(x) =
1+e−x

Logistic regression has shown that the logistic function can be used for modelling probability, and
therefore classification. However, the logistic function can be mostly effective if the classes are binary.
The principles are the same with the perceptron.

Advanced Activation Functions

Another useful activation function for classification tasks is the softmax function. The softmax
function can be used for multi-class classification tasks. The output of this activation function is a
vector, containing probabilities of each class. These classes of course make up all possible choices,
therefore the sum of all probabilities in the vector is 1. Therefore, the softmax function generates a
probability distribution from the input features.

ex
softmax(x) = i
∑j ej
5.3. TRAINING PROCESS 103

A newly discovered variant of the identity function has been used for neural network training is
the relu function. relu stand for Rectified Linear Unit, as it is a linear identity function with a cut-off
at y=0.

{
0 if x < 0
relu(x) =
x if x >= 0

Relu is useful, as it is non-linear and easy to compute, despite being called a linear unit. Because
of the easy computation, it is also fast. Relu retains important information in its original form,
instead of transforming it onto different boundaries. The downside of this property is that the
relu function does not scale big values down, which can cause problems during calculation.
Many of deep learning’s successes are attributed at least partly to the relu activation function.
Example 5.3.1. Let us go through an example of what a perceptron calculates with a given input
data. To summarize how perceptrons work, will use the individual parts in the following example.
The perceptron itself processes the input using the following formula: o = f((x1 ∗ w1 + x2 ∗ w2) +
b).
1. x1, x2 ∈ X is the input vector
2. w1, w2 ∈ W is the weight vector
3. b is the bias of the perceptron
4. f() is the activation function
Our calculation Let us say that W = [2,1], and b = 2. W is the vector form of (w1 = 2, w2 = 1).
Let us assume that our input X is [1,3].
The calculation should now look like this:
f((1 ∗ 2 + 3 ∗ 1) + 2) = f(5 + 2) = f(7)

Let us use the logistic activation function f(x) = 1


1+e−x

f(7) = 0.99999
[1]: import numpy as np

def sigmoid(x):
return 1/(1 + [Link](-x))

[2]: sigmoid(7)

[2]: 0.9990889488055994

So, for the input values of X = [1,3] to a perceptron with weights W = [2,1] and b = 2 the
output is 0.9999.
Let us create a basic implementation of this structure in python.
The first block contains a Perceptron class which implements the basic functionalities of the per-
ceptron. The __init__ method is the constructor of the class, which can be called by instantiating
104 CHAPTER 5. MACHINE LEARNING

the class by Perceptron(). The weights, and bias are parameter values, while the activ_function represents
the activation function which is also passed as a function parameter.

[1]: import numpy as np

class Perceptron:

def __init__(self, W, b, activ_func):


[Link]=W
[Link]=b
self.activ_func=activ_func

def execute(self, x):


return self.activ_func([Link]([Link], x) + [Link])

The sigmoid function is used for the activation function, which can be passed as a parameter.

[2]: def sigmoid(x):


return 1/(1 + [Link](-x))

[3]: W=[2,1]
b=2
perceptron=Perceptron(W, b, sigmoid)

The execute() function can called with a set of input parameters in order to calculate the outcome. The
calculated results are the same as the values calculated by hand.

[4]: [Link]([1,3])

[4]: 0.9998766054240137

5.3.2 Neural Networks


A single perceptron was thought of as a processing unit capable of calculating simple linear or
non-linear calculations, that have inner states described with weights. The calculations can be
done by hand, and are not very complicated.
Connecting these perceptrons creates a chain of calculations. At first, the input is calculated on
the input vector, and then the output of the first perceptron becomes the input of the next one.
Creating a chain of perceptron continuously updates the inner values based on the perceptron’s
weights, and the activation functions it they use. At the output of the last perceptron is the final
output of the network. Perceptrons can be chained together in different ways, as a perceptron
could be connected with different perceptrons which receive the input simultaneously.
The term perceptron is usually used for a single computation node, while Neuron is used if it’s
embedded in a network. A connected network of neurons is called an artificial neural net-
work.[18] As mentioned above, neurons can receive, and send input and output values from
and to multiple different nodes. A set of neurons which are only connected to previous and
subsequent neurons in the chain is called a layer If the output connection of the neurons is only
connected to neurons in the next layer, the network is called a feed forward artificial neural network.
5.3. TRAINING PROCESS 105

There are three distinguished layers in feed forward networks. The input layer represents the
incoming data features. The number of nodes are the number of features in the data set. The
output layer contains the neuron nodes, and the outputs are calculated by the neuron calcula
tions.

If there are any layers in between the input and the output layers, it is called the hidden layer. The
hidden layer provides complexity to the artificial neural network. A single layer of neurons in the
hidden layer provides a way to further process the data before going to the output neurons.
Increasing the number of hidden layers creates an inherent complexity of calculations that is very
hard to follow, and calculate by hand.

Each neuron connection contains separate weight vectors and each neuron has separate biases.
This structure is sufficiently complex to approximate complicated functions. The problem with
increasing the number of hidden layers is that the model essentially becomes a black box, with
little to no way to explain the reason behind outcomes.

Example 5.3.2. In the next example, the output of a simple neural network with 3 neurons will be
calculated.

Let us work with our previous example. Assume that the connection of all neurons have the
same weight vector W = [1,2], and bias of b = 2. The network consists of an output neuron, and
two hidden neurons.
106 CHAPTER 5. MACHINE LEARNING

With the same X=[3,1] the calculation of the hidden neurons will be the following:
1.
f(W ∗ X + b)

2.
h1 = sigm((1 ∗ 3 + 2 ∗ 1) + 2)

3.
h1 = sigm(7) = 0.999 = h2

Since all weights and biases are the same, the output of both hidden neurons is going to be the same.
Since the hidden neurons are connected to the output neuron, the output is sent to the output
neuron.
1.
o=sigm((1∗0.999+2∗0.999)+2)

2.
o=sigm(4.997)

3.
o=0.993

Therefore, the final output of the artificial neural network will be 0.993.
In the following code blocks, the implementation of a simple 3 neuron artificial neural network
is implemented. Separate weights, biases, and activation functions could be used, but all neu
rons use the same weights for the sake of keeping the implementation simple. The perceptron
model from the last example is renamed to Neuron. The ANN class uses global weights and
biases with 2 hidden, and one output neuron. The execution of the neural network calculation
5.3. TRAINING PROCESS 107

is called by the output neuron’s execute function with the return values of the hidden neuron’s
outputs.

[1]: import numpy as np

class Neuron:

def __init__(self, W, b, activ_func):


[Link]=W
[Link]=b
self.activ_func=activ_func

def execute(self, x):


return self.activ_func([Link]([Link], x) + [Link])

[2]: class ANN:

def __init__(self, weights, g_bias, activ_func):


[Link]=weights
[Link]=g_bias
self.h1=Neuron([Link], [Link], activ_func)
self.h2=Neuron([Link], [Link], activ_func)
self.o=Neuron([Link], [Link], activ_func)

def execute(self, x):


return [Link]([Link]([[Link](x), [Link](x)]))

[3]: def sigmoid(x):


return 1/(1 + [Link](-x))

[4]: W=[2,1]
b=2
ann=ANN(W, b, sigmoid)

[5]: [Link]([3,1])

[5]: 0.993304687623845

You might wonder how these model is capable of predicting the outcomes of complex processes.
The network is composed of only weights and data transformations. How is this going to
classify what kind of flower is being described, or whether or not a person has diabetes based
on previous data?
Of course, a neural network can be thought of as a form of function approximation. There is
a hidden function, which takes a number of inputs and outputs the right answer all the time.
As discussed before, machine learning models are trying to learn this hidden function through
learning. But what is the process of learning exactly? In the case of linear and logistic regression,
108 CHAPTER 5. MACHINE LEARNING

it was the modification of two parameters, and calculating the loss. The training process is
exactly like that for all supervised machine learning models.
Training the neural networks involves modifying independent parameters in order to slightly
modify the outcome.
1. First, the output is calculated for a set of the input data.
2. The loss is calculated between the actual outcome, and the predicted output s.
3. Based on the loss, an optimizer algorithm is used to slightly modify the parameters of the
model.
The optimizer must be able to decide which parameter to change in what way. Slight change of
model parameters can change the output drastically. Therefore, changing the parameters once
is not enough. In machine learning an iteration of training is usually called an "epoch".
The model is trained until the loss of the model is less than a minimal amount of difference, or the
predefined amount of epochs have passed. In the case of the neural network, the trainable
parameters are the weights and biases of each connection between the neurons.

5.4 Model Optimization Methods


Mathematical optimization is often described as the process of finding, and selecting the best
element out of a set of available values. The process of optimization usually consists of a max
imum or minimum constraint that has to be fulfilled as the optimization term. In machi ne
learning and data science optimization methods are used for a wide range of applications.
Optimization techniques and methods are a big field of mathematics, and we would only like to
cover the most essential parts for machine learning applications.
To understand how the optimization algorithms work, let us look at an example.
Example 5.4.1. Let us look at the following table of data:

Index x_1 x_2 y


0 1 2 1
1 3 4 0
2 5 6 1
3 7 8 0

We’ll use our neural network, and we’ll get the following table of outputs:

Index y ŷ
0 1 1
1 0 1
2 1 1
3 0 1

Let us calculate the mean squared error of the actual and the predicted outputs.

1 1 2
MSE = 4 ((1 − 1)2 + (0 − 1)2 + (1 − 1)2 + (0 − 1)2) = 4 (0 + 1 + 0 + 1) = 4 0.5
=
5.4. MODEL OPTIMIZATION METHODS 109

So, despite our model only producing 1-s, our model is still right 50% of the time! Some might
call that a win, but you might quickly realise that being right 50% of the time actually means
that the model is completely unreliable. Of course, we know from looking at the output that
that is most likely not true. Nevertheless, we are going to improve this probability by training
our model.

In the following codeblocks, a simple implementation of the mean squared error loss can be seen.

[1]: import numpy as np


def mean_squared_error(y, y_predict):
return [Link](((y-y_predict)**2))

[2]: y=[Link]([1,0,1,0])
y_predict=[Link]([1,1,1,1])

mean_squared_error(y, y_predict)

[2]: 0.5

5.4.1 Optimization Process

As we discussed before, the optimizer’s task is to find the optimal model state in order for it to
function as an appropriate approximator of the original function. In this case, the measurement
we’re using is the MSE. Therefore, the job of the optimizer is to somehow iteratively reduce the
value of the MSE.

By looking at the table of data, we know that the model has actually made two mistakes:

Index x_1 x_2 y


1 3 4 0
3 7 8 0

So the goal of the optimizer is to change the connection weights and biases in some way that the
neural network outputs 0 for these inputs. Of course, changing the weights could change the
outputs for the correct predicted outputs.

In order to optimize this process, we will have to look at this problem from a different angle.
110 CHAPTER 5. MACHINE LEARNING

On Figure ??, a figure of the neural network structure can be seen with individual weights and
biases.
The output is the product of the input, and the transformation occurring in the neural networks. This
transformation can only be changed by changing the weights and biases. Therefore, the loss of the
whole neural network can be expressed with the following function:

L(w1, w2, w3, w4, w5, w5, b1, b2, b3)


Changing any of these variables will consequently change the loss as well. Let
us try to change w2 in order to change the loss.
Example 5.4.2. Since L is a multivariate function, it can be derived. In this
example we’ll only one row of data:

Index x_1 x_2 y


1 3 4 0

In this case, our loss is the following:

MSE = (y − ŷ)2 = (1 − ŷ)2)


5.4. MODEL OPTIMIZATION METHODS 111

In order to calculate the amount of difference changing the w2 weight would bring, the partial
derivative ∂L as to be calculated.
∂w2 h
At this point, the calculation becomes problematic, because actually calculating this value by
hand, or by a computer actually takes a lot of resources. Therefore, we are going to use a very
nice formula of calculus, the chain rule. The chain rule decomposes composite differentiable
functions.
Using the chain rule on this partial derivative creates the following decomposed derivatives:

L ∂ŷ
∂L
∂ ŷ
∂w2 = ∗ ∂w2

In order to calculate ∂Lŷ,[Link],wehavealreadyconcludedthat


in this simple case, the loss is (1 − ŷ)2).
Therefore, the calculation becomes the following:

∂L
− ( − )
∂ŷ = 2 1 ŷ
∂ŷ
We have successfully calculated one part of the equation, but remains to be calculated.
∂w2 r
In order to calculate this component of the derivative, we have to know what the output of the
network is. As discussed before, this can be calculated by ŷ = f(w5 ∗ h1 + w6 ∗ h2 + b3). The
w2 weight only affects the neuron h2. Therefore, we can calculate ∂Lŷ in the following way:

∂ŷ ŷ h2
∂∂h ∂w
∂w2 = 2∗ 2
Where the first term can be calculated using the derivative of ŷ.

∂ŷ
w6 f′ (w3 ∗ x1 + w4 ∗ x2 + b2)
∂h2 = ∗

Where the derivative function f′() is the derivative of the sigmoid function. One important
attribute of activation functions is that they are always easily differentiable. The reason is that
during training, the function has to be derived a huge number of times. Therefore, to save time
and computing requirements it is only logical to choose functions which can be differentiated
easily.
The derivative of the sigm(x) = 1 unction is sigm′(x) = f(x) ∗ (1 − f(x)) We can calculate
1+e−x
f
the second term of the previous decomposed derivative using h2 = f(w3 ∗ x1 + w4 ∗ x2 + b2):
∂h2
x2 f (w3 ∗ x1 + w4 ∗ x2 + b2)
∂w2 = ∗ ′

We have calculated all parts of the decomposed partial derivative.


∂L L ∂ŷ h2
∂ ∂w
∂w2 = ŷ ∗ ∂h2 ∗ 2

This process is called backpropagation.


112 CHAPTER 5. MACHINE LEARNING

Backpropagation

Backpropagation [9] is used by gradient-based optimized methods. The idea behind this method
is to use the outcomes of the function to propagate back the importance of each parameter. The
parameter importance (weight) is propagated back by calculating the partial derivatives
backwards towards the input layer from the output.

5.4.2 Gradient Descent


Now we know what methods to use in order to optimize our weights, calculate the loss, and predict
outputs. However, in order for the artificial neural network to work, we need an optimiza tion
strategy. Gradient Descent optimizers have been the first, and one of the most prominent
optimization methods of ANNs. Variants of gradient descent use backpropagation to update
weights individually. In gradient descent, the previous partial derivative is calculated for every
weight, and every input combination. This is an exhaustive optimization of the weights, as all
inputs are used during every epoch. It is unusual to use gradient descent as an optimizer for a
machine learning model, as input data sets are usually very large.
Instead of gradient descent, stochastic gradient descent and mini-batch gradient descent have
been two optimizers that are currently being used with great results.

Stochastic Gradient Descent

The stochastic gradient descent (SGD) algorithm uses the same ideas and formulas as the gradient
descent algorithm. The difference is that while gradient descent calculates the gradient for all
possible training input values, the stochastic version randomly chooses one data row to
optimize the whole network on. Therefore, the stochastic gradient descent is significantly faster.
However, since only one input value is included during the optimization, it also takes
significantly more epochs to train the network.
Training with the stochastic optimizer is also volatile, because the choice of input can significantly
change the efficiency of the optimization process. As a side effect of this volatility. multiple runs
could produce different accuracies. The random nature of this method can also prove useful in
certain scenarios, as the volatility can be used as a way to more effectively find a global optimum.
This effect is attributed to the random jumps in loss when optimizing with certain values in the
data set that might not follow the general shape of the data set.

Mini-batch Gradient Descent

The mini-batch method was born as a combination of the gradient descent, and stochastic vari
ant. The methods pick a random batch of data from the input rows. The number of elements
in the batch is not fixed, and can usually be set. Setting this parameter to 1 results in a stochastic
gradient descent, while setting it to the number of rows resulting the gradient descent optimiza
tion algorithm.
The mini-batch gradient descent is one of the most widely used optimizer algorithms in ma -
chine learning. It provides a good middle ground between the volatile SGD, and the slow
gradient descent algorithm. The gradient descent algorithm is still more reliable, but numerous
experiments have shown that the added stability does not increase the learning tremendously.
The benefit of using mini batching instead of stochastic is the lack of random movement in the
loss during training. Mini-batching provides a relatively fast and stable optimization process.
5.4. MODEL OPTIMIZATION METHODS 113

Learning rate

Learning rate is a variable used during the of the learning process. It is either fixed at the start
of the learning process, and not changed thorough the optimization, or changed dynamically
to fit the need of the optimization algorithm. It is a measure of how fast the algorithm should
train the network, and it changes the amount of change to weights after each epoch.
In order to train the model, the optimizer has to change the weights. The learning parameter is
a multiplier of which the partial derivative is multiplied with. The resulting value is then
subtracted from the original weight value in order to calculate the new weight.

∂L
wi = wi − η ∗ ∂w
i

Where η is the learning rate value.


If the learning rate is a fixed value chosen at the start of the algorithm, it is conventionally called static.
Choosing a good static learning rate can be a challenging task. A value that is too low can slow the
algorithm down so much that many epochs are required in order for the optimizer to learn the
underlying inferences in the data. A high learning rate can cause the model to diverge instead of
finding the global optimum weight values for the input data. A divergent model improves
rapidly at the start, but slowly comes to a halt after not being able to fine tune the weight values.
Divergence during model optimization should be avoided.
If the learning rate is set automatically after each epoch, it is usually called dynamic. A dynamic
learning rate can greatly increase the convergence of the machine learning model, and requires no
prior knowledge about the data set in order it to be effective. Therefore, a dynamic learning rate is
recommended whenever possible.
Example 5.4.3. In our final neural network example, we’ll create a full implementation of a neural
network with a stochastic gradient descent optimizer and a static learning rate.
We will use the previously defined Neuron, sigmoid and MSE functions.

[1]: import numpy as np

class Neuron:

def __init__(self, W, b, activ_func):


[Link]=W
[Link]=b
self.activ_func=activ_func

def execute(self, x):


return self.activ_func([Link]([Link], x) + [Link])

[2]: def sigmoid(x):


return 1/(1 + [Link](-x))

[3]: def deriv_sigmoid(x):


return sigmoid(x) * (1 - sigmoid(x))
114 CHAPTER 5. MACHINE LEARNING

[4]: def mean_squared_error(y, y_predict):


return [Link](((y-y_predict)**2))

The ANN class is extended with different weights and biases. A train method is introduced
which can be used to iteratively optimize the weights of the network based on the SGD optimizer
algorithm. The training values are output during the training, which enables the user to analyze
the training results mid-run.
A history is returned from the training, which contains the accumulated indexes, and the loss
values.
[5]: class ANN:

def __init__(self):
self.w1 = [Link]()
self.w2 = [Link]()
self.w3 = [Link]()
self.w4 = [Link]()
self.w5 = [Link]()
self.w6 = [Link]()
self.b1 = [Link]()
self.b2 = [Link]()
self.b3 = [Link]()
self.h1=Neuron([self.w1, self.w2], self.b1, sigmoid)
self.h2=Neuron([self.w3, self.w4], self.b2, sigmoid)
self.o=Neuron([self.w5, self.w6], self.b3, sigmoid)

def execute(self, x):


return [Link]([Link]([[Link](x), self.h2.
↪→execute(x)]))

def train(self, data, y):

learn_rate = 0.1
epochs = 1000
hist=[]
for epoch in range(epochs):
for x, y_true in zip(data, y):
sum_h1 = [Link][0] * x[0] + [Link][1] * x[1] +␣
↪→[Link]

h1 = sigmoid(sum_h1)
sum_h2 = [Link][0] * x[0] + [Link][1] * x[1] +␣
↪→[Link]

h2 = sigmoid(sum_h2)
sum_o1 = [Link][0] * h1 + [Link][1] * h2 + self.o.
↪→bias

o1 = sigmoid(sum_o1)
y_pred = o1
5.4. MODEL OPTIMIZATION METHODS 115

d_L_d_ypred = -2 * (y_true - y_pred)


d_ypred_d_w5 = h1 * deriv_sigmoid(sum_o1)
d_ypred_d_w6 = h2 * deriv_sigmoid(sum_o1)
d_ypred_d_b3 = deriv_sigmoid(sum_o1)
d_ypred_d_h1 = [Link][0] * deriv_sigmoid(sum_o1)
d_ypred_d_h2 = [Link][1] * deriv_sigmoid(sum_o1)
d_h1_d_w1 = x[0] * deriv_sigmoid(sum_h1)
d_h1_d_w2 = x[1] * deriv_sigmoid(sum_h1)
d_h1_d_b1 = deriv_sigmoid(sum_h1)
d_h2_d_w3 = x[0] * deriv_sigmoid(sum_h2)
d_h2_d_w4 = x[1] * deriv_sigmoid(sum_h2)
d_h2_d_b2 = deriv_sigmoid(sum_h2)
[Link][0] -= learn_rate * d_L_d_ypred * d_ypred_d_h1 *␣
↪→d_h1_d_w1
[Link][1] -= learn_rate * d_L_d_ypred * d_ypred_d_h1 *␣
↪→d_h1_d_w2
[Link] -= learn_rate * d_L_d_ypred * d_ypred_d_h1 *␣
↪→d_h1_d_b1
[Link][0] -= learn_rate * d_L_d_ypred * d_ypred_d_h2 *␣
↪→d_h2_d_w3
[Link][1] -= learn_rate * d_L_d_ypred * d_ypred_d_h2 *␣
↪→d_h2_d_w4
[Link] -= learn_rate * d_L_d_ypred * d_ypred_d_h2 *␣
↪→d_h2_d_b2
[Link][0] -= learn_rate * d_L_d_ypred * d_ypred_d_w5
[Link][1] -= learn_rate * d_L_d_ypred * d_ypred_d_w6
[Link] -= learn_rate * d_L_d_ypred * d_ypred_d_b3

if epoch % 10 == 0:
y_preds = np.apply_along_axis([Link], 1, data) loss
= mean_squared_error(y, y_preds)
print("Epoch %d loss: %.3f" % (epoch, loss))
[Link]([epoch, loss])
return [Link](hist)

[6]: X = [Link]([[1, 2],


[3, 4],
[5, 6],
[7, 8],
])

y = [Link]([ 1, 0, 1, 0, ])

[7]: network = ANN()


hist=[Link](X, y)

Epoch 0 loss: 0.266


Epoch 10 loss: 0.253
116 CHAPTER 5. MACHINE LEARNING

Epoch 20 loss: 0.248


Epoch 30 loss: 0.245
Epoch 40 loss: 0.244
Epoch 50 loss: 0.243
Epoch 60 loss: 0.242
Epoch 70 loss: 0.241
Epoch 80 loss: 0.240
Epoch 90 loss: 0.240
Epoch 100 loss: 0.239
Epoch 110 loss: 0.238
Epoch 120 loss: 0.238
Epoch 130 loss: 0.237
Epoch 140 loss: 0.236
Epoch 150 loss: 0.236
Epoch 160 loss: 0.235
Epoch 170 loss: 0.234
Epoch 180 loss: 0.233
Epoch 190 loss: 0.232
Epoch 200 loss: 0.232
Epoch 210 loss: 0.231
Epoch 220 loss: 0.230
Epoch 230 loss: 0.229
Epoch 240 loss: 0.228
Epoch 250 loss: 0.227
Epoch 260 loss: 0.226
Epoch 270 loss: 0.225

Epoch 670 loss: 0.187


Epoch 680 loss: 0.186
Epoch 690 loss: 0.185
Epoch 700 loss: 0.185
Epoch 710 loss: 0.184
Epoch 720 loss: 0.184
Epoch 730 loss: 0.184
Epoch 740 loss: 0.183
Epoch 750 loss: 0.183
Epoch 760 loss: 0.182
Epoch 770 loss: 0.182
Epoch 780 loss: 0.182
Epoch 790 loss: 0.181
Epoch 800 loss: 0.181
Epoch 810 loss: 0.180
Epoch 820 loss: 0.180
Epoch 830 loss: 0.180
Epoch 840 loss: 0.180
Epoch 850 loss: 0.179
Epoch 860 loss: 0.179
5.4. MODEL OPTIMIZATION METHODS 117

Epoch 870 loss: 0.179


Epoch 880 loss: 0.178
Epoch 890 loss: 0.178
Epoch 900 loss: 0.178
Epoch 910 loss: 0.178
Epoch 920 loss: 0.177
Epoch 930 loss: 0.177
Epoch 940 loss: 0.177
Epoch 950 loss: 0.177
Epoch 960 loss: 0.177
Epoch 970 loss: 0.176
Epoch 980 loss: 0.176
Epoch 990 loss: 0.176

[8]: hist=[Link](hist)

[9]: %matplotlib inline


import [Link] as plt
[Link]('seaborn-whitegrid')
fig = [Link]()
ax = [Link]()
x = [Link](0, 10, 1000)
[Link](hist[:,0], hist[:,1])

[9]: [<[Link].Line2D at 0x2a23ee2f3c8>]

To create, and run such neural network structure in a modern python library, only a few lines of
code is required . The code block below shows the creation, training, and evaluation of a neural
118 CHAPTER 5. MACHINE LEARNING

network consisting of a hidden layer with 50 neurons, and one with 100.

[1]: import [Link] as plt


import numpy as np
import pandas as pd
from sklearn import datasets
from [Link] import mean_squared_error, r2_score, mean_absolute_error
from sklearn.model_selection import train_test_split
from [Link] import StandardScaler
from sklearn.neural_network import MLPRegressor

There are two types of feed forward neural network implementations in Scikit-Learn: MLPRe-
gressor and MLPClassifier. The MLPRegressor can be used to learn regression problems, while
MLPClassifier is used to learn classification tasks. It is always a good idea to scale the original
data, as the gradient descent algorithm does not work well with features of different scales.

[2]: X, y = datasets.load_diabetes(return_X_y=True)
X_scaled=StandardScaler().fit_transform(X=X)

[3]: X_train, X_test, y_train, y_test = train_test_split(


X_scaled, y, test_size=0.33, random_state=42)

The model can be changed easily by passing different parameters to the MLPRegressor class.
There are a number of different optimizers, activation functions available to experiment with.
The Scikit-Learn package does not support the individual setting of activation functions, or
different layer types. Keras is recommended for implementing advance neural network struc
tures.
[4]: model = MLPRegressor(max_iter=1000, hidden_layer_sizes=(50,100)).fit(X_train,␣
↪→y_train)

5.4.3 Prediction
After training, the Artificial Neural Network can be used to predict new data. This can be
achieved by propagating the input parameter through the network. The output of this function
should be the predicted value.
Example 5.4.4. The scoring mechanism of both neural network implementations will be presented
in this example.

[10]: print([Link]([0,3]), "\n",


[Link]([3,5]), "\n",
[Link]([5,8]))

0.9217925143269068
0.3532494722460966
0.338036614190695
Following up on the previous MLPRegressor example, we’ll inspect the predictio n method of
the model. In order to predict with the created model, the model’s predict() function has to be
5.4. MODEL OPTIMIZATION METHODS 119

used. The function work with a matrix of input values by default. In order to predict only one
row of data, the data has to be reshaped.

[6]: [Link](X_test[0].reshape(1, -1))

[6]: array([165.90817188])

The score() function returns the accuracy of the model. It can be seen easily that the model did
not learn the data very well. Achieving a score of 0.5 is the equivalent of random choice. You
can experiment with the model by setting different parameters on the MLPRegressor. The list
of available parameters can be read up on the MLPRegressor documentation.

5.4.4 Overfitting
Increasing learning time and learning data might sound like an intuitive way of increasing
model performance. Although increasing the epochs can decrease loss over time, the model
might not be better at actually predicting the true outcomes of unseen data. This can be
explained by model overfitting. As we have discussed before, when a model is trained with data,
it is actually trying to approximate some hidden function which connects the data points in the N
dimensional space. If we try to learn this data with linear regression, we are trying to fit a straight
line in the N dimensional space that best fits the data points.
This process is the same with all supervised learning models, but instead of a straight line, we are
trying to fit a high-dimension function onto the data points in the same dimension. The data points
might describe a general angle, and shape in which the data points were recorded and labelled.
It is easy to see, that data points outside the general shape of the data points can occur. Overfitting
appears when the model is trained rigorously to replicate the exact shape of the input data set.
The approximated function will be very complex, and would be able to describe input values which
lay in the shape of the original dataset. However, if an over fit model is given a data point that is an
outlier with respect of the trained data set, the model will most probably fail to predict the output
with small error.
Overfitting is a serious issue during model training, as it will decrease model generalization
ability in the prediction phase. There are methods which decrease overfitting, but it is a problem
that is hard to get rid of. In short, training the model to exactly follow the original data set
decreases its ability create a general function approximation that is capable of correctly predicting
unseen data.

Dropout
One of the simplest, and most effective ways to avoid overfitting is to utilize a special technique
called Dropout layer. A dropout layer consists of a probability p, and is inserted before a given
fully connected layer. The following layer’s neurons have p probability to drop out of the next
training epoch. The reason for overfitting is can be often attributed to the the over -reliance on
certain neurons in the networks. Neurons could accumulate huge amounts of weight, while the
others slowly decrease in importance. This could lead to a situation where an input which is
not directly connected to the important neuron could generate high amounts of loss.
By dropping neurons in the layer randomly, it decreases the chance of such over reliant neuron
appearing during training. It is important to only use dropout during training, as predicting with
a scarce network can lead to bad prediction values.
120 CHAPTER 5. MACHINE LEARNING

To combat the anomalies that might occur in a dropout layer, the values in the connected
feedforward layer are multiplied with the dropout probability p in the prediction phase.

Regularization
Another useful technique to use is called regularization. Regularization uses mathematical func
tions used regularly in the subject of statistical analysis. The idea behind regularization is to
discourage the learning of complex inferences in the model by introducing coefficients during
the optimization.

Lasso Regularization
There are two types of which we’ll discuss: L1(Lasso), and L2(Ridge) regularization. L1
regularization takes the loss of the function, and adds a λ||x|| value to it, which is the regularization
term. The absolute value of the weights is multiplied with a λ regularization parameter. This
parameter serves as a static value, set at the start of the training. L1 regularization often serves as a
term to decrease the values of less important features. L1 is capable of reducing weights to 0, which
effectively deactivates certain neurons.

Ridge Regularization
L2 regularization effectively uses the same formula as L1, but takes the square of the absolute weight
values. The formula for L2 regularization is λ||x||2, where ||x||2 is the square of the second normal
of the weights multiplied with the regularization parameter λ L2 regularization achieves the same
result as L1 regularization, but cannot reduce neuron connections to 0 weights. For this reason, L2
regression is used more often, as the deactivation of neurons is not a goal of regularization in most
cases.

Elastic Net Regularization


The term Elastic net is used for the combination of L1 and L2 regularization. This is achieved by
choosing a α parameter, which is used in the following formula to determine which regularization to
use in what magnitude:

Lerror + α ∗ L1term + (1 − α) ∗ L2term

Example 5.4.5. In Scikit-Learn’s neural network implementation, only the L2 regularization term
can be used during training. Dropout, or other regularisation are not supported as of writing
this lecture notes.
The regularization term can be changed by passing different values to the alpha parameter of the
MLPRegressor.

[ ]: import [Link] as plt


import numpy as np
import pandas as pd
from sklearn import datasets
from [Link] import mean_squared_error, r2_score, mean_absolute_error
from sklearn.model_selection import train_test_split
from [Link] import StandardScaler
5.5. UNSUPERVISED LEARNING 121

from sklearn.neural_network import MLPRegressor

[2]: X, y = datasets.load_diabetes(return_X_y=True)
X_scaled=StandardScaler().fit_transform(X=X)

[3]: X_train, X_test, y_train, y_test = train_test_split(


X_scaled, y, test_size=0.33, random_state=42)

[4]: model = MLPRegressor(max_iter=1000, hidden_layer_sizes=(50,100), alpha=0.25).


↪→fit(X_train, y_train)

E:\Anaconda\lib\site-
packages\sklearn\neural_network\_multilayer_perceptron.py:571:
ConvergenceWarning: Stochastic Optimizer: Maximum iterations (1000) reached and the
optimization hasn't converged yet.
% self.max_iter, ConvergenceWarning)

[5]: [Link](X_test, y_test)

[5]: 0.4552609095081024

[6]: [Link](X_test[0].reshape(1, -1))

[6]: array([151.56398853])

5.5 Unsupervised Learning


Supervised learning algorithms were capable of learning the mapping between a given input, and
the output combination. They are called supervised, because for every input, they are also receiving
an associated output value.
Unsupervised Learning algorithms [3] are different, as they receive no associated output values.
Unsupervised learning methods process a set of input values in order to create an associated output
for the data set. This could be new classes for data points based on internal attributes, an ordering
between element, or a hierarchy. Unsupervised learning algorithms are also capable of
transforming the input data into a slightly transformed latent representation.
This lecture note will not discuss unsupervised learning methods in detail.

5.5.1 Clustering
Clustering is the process of selecting certain elements in the data set, and assigning classes,
based on a clustering function. The classes are not set previously on the elements. The require
ment on clustering algorithms is that for a given cluster, all elements must be more similar to
each other in some way, than the elements of other clusters. Clustering is not an algorithm,
but rather a collection of methods, which are capable of calculating similarity, and assigning
classes to similar elements.
122 CHAPTER 5. MACHINE LEARNING

Clustering algorithms often use the density function of features in order to determine different
output classes. Another way of calculating similarity is using distance functions. Distance in this
topic can be any measurement that can be used to measure similarity.

5.5.2 Hierarchical Clustering


Hierarchical clustering is one subcategory of clustering, and a collection of algorithms. It uses a
class hierarchy to represent different layers of classes. Hierarchical methods use simple distance
functions in order to determine which label does a particular data point possesses. The idea
behind this category is that elements which are close to one another often belong in the same
class.

Dendogram
Dendograms are trees representations which are often used in hierarchical clustering. A den dogram
can visualize how the classes are made in hierarchical clustering.

5.5.3 Centroid-based Clustering


Centroid-based Clustering techniques use algorithms to determine center points for a given
number of labels. When given a parameter k, which is the number of desired clusters, centroid-
based clustering algorithms aim to find k clusters that have data points assigned to their nearest
central point. The most popular centroid-based clustering method is called K-means clustering.
K-means clustering does just that: It starts off from k number of data points that represent the
clusters. The nearest data point is added, and the center is recalculated. At every iteration, the
nearest data point to the cluster center is added to the cluster, and the center is recalculated.
The problem with this algorithm is that it requires the user to input how many clusters are
to be created. This knowledge is often missing, and therefore the outcomes can be based on
preference.

5.5.4 Density-based clustering


Density based clustering methods measure the density of data points inside the N dimensional input
space. Areas of higher density data points can make a separate cluster, while outlier data can be
detected easily, as they will not belong to any cluster because of the lack of nearby data points. One
of the most often use density based clustering algorithm is DBSCAN.

Question 93.
What are the two main categories of machine learning algorithms?
Question 94.
Can machine learning algorithms make intelligent decisions?

Question 95.
What are the two main categories of machine learning algorithms?

Question 96.
What are the two main requirements of modern machine learning algorithms?
5.5. UNSUPERVISED LEARNING 123

Question 97.
What is the goal of supervised learning?

Question 98.
What is the difference between supervised learning and unsupervised learning algorithms?

Question 99.
What is the biological equivalent of supervised learning?

Question 100.
What are the requirements of data for a supervised learning problem?

Question 101.
How does a machine learning algorithm quantify error?

Question 102.
What is the process of learning in a machine learning algorithm?

Question 103.
What is the loss value if the original function, and the approximated function are identical?

Question 104.
Can you easily explain the processes behind a complex machine learning system?

Question 105.
Why do you need to split the data before learning with machine learning algorithms?

Question 106.
What is the purpose of the training, testing and validation sets?

Question 107.
Why does the data have to be shuffled before splitting?

Question 108.
Why does the data have to be shuffled before splitting?

Question 109.
Why is the splitting size important for the machine learning algorithm?

Question 110.
What is linear in the linear regression algorithm?

Question 111.
What kinds of problems can the linear regression problem solve?

Question 112.
What are the most common loss functions used for regression problems?
124 CHAPTER 5. MACHINE LEARNING

Question 113.
How can you tell that the linear regression algorithm is not a good model for the data?

Question 114.
What category of problems can the logistic regression solve?

Question 115.
What is the logistic function?

Question 116.
What variables are trained during logistic regression training?

Question 117.
What type of logistic regression has two types of target values?

Question 118.
What is the difference between a perceptron and an artificial neuron?

Question 119.
How does the perceptron calculate its output?

Question 120.
What kinds of activation functions can be used in neurons?

Question 121.
How are artificial neural networks built?

Question 122.
What is a neural network layer?

Question 123.
What types of layers there are in multilayer neural networks?

Question 124.
How many neurons can a layer contain?

Question 125.
How many layers can a hidden layer contain?

Question 126.
How are weights calculated from neural network prediction loss?

Question 127.
What types of gradient descent are there?

Question 128.
What is the learning rate of the gradient descent algorithm?
5.5. UNSUPERVISED LEARNING 125

Question 129.
What types of learning rate are there?

Question 130.
Which class can be used to create multilayer artificial neural network in Scikit-learn’s library?

Question 131.
What are the dangers of overfitting?

Question 132.
What techniques can you use to avoid overfitting?
126 CHAPTER 5. MACHINE LEARNING
Bibliography

[1] Martín Abadi et al. “Tensorflow: A system for large-scale machine learning”. In: 12th
{USENIX} symposium on operating systems design and implementation ({OSDI} 16). 2016, pp. 265-
283.
[2] Hervé Abdi and Lynne J Williams. “Principal component analysis”. In: Wiley interdisci
plinary reviews: computational statistics 2.4 (2010), pp. 433-459.
[3] Horace B Barlow. “Unsupervised learning”. In: Neural computation 1.3 (1989), pp. 295-311.
[4] Rich Caruana and Alexandru Niculescu-Mizil. “An empirical comparison of supervised
learning algorithms”. In: Proceedings of the 23rd international conference on Machine learning.
2006, pp. 161-168.
[5] Francois Chollet. Deep Learning with Python. 2017.
[6] Gene H Golub and Christian Reinsch. “Singular value decomposition and least squares
solutions”. In: Linear Algebra. Springer, 1971, pp. 134-151.
[7] Jiawei Han, Jian Pei, and Micheline Kamber. Data mining: concepts and techniques. Elsevier,
2011.
[8] David J Hand and Niall M Adams. “Data Mining”. In: Wiley StatsRef: Statistics Reference
Online (2014), pp. 1-7.
[9] Robert Hecht-Nielsen. “Theory of the backpropagation neural network”. In: Neural net-
works for perception. Elsevier, 1992, pp. 65-93.
[10] Surya R Kalidindi et al. “Application of data science tools to quantify and distinguish
between structures and models in molecular dynamics datasets”. In: Nanotechnology 26.34 (2015),
p. 344006.
[11] Huan Liu and Hiroshi Motoda. Feature extraction, construction and selection: A data mining
perspective. Vol. 453. Springer Science & Business Media, 1998.
[12] Wes McKinney et al. “pandas: a foundational Python library for data analysis and statis
tics”. In: Python for High Performance and Scientific Computing 14.9 (2011).
[13] Ben Noble, James W Daniel, et al. Applied linear algebra. Vol. 3. Prentice-Hall Englewood
Cliffs, NJ, 1977.
[14] Travis E Oliphant. A guide to NumPy. Vol. 1. Trelgol Publishing USA, 2006.
[15] Sankar K Pal and Sushmita Mitra. “Multilayer perceptron, fuzzy sets, classification”. In:
(1992).
[16] Fabian Pedregosa et al. “Scikit-learn: Machine learning in Python”. In: The Journal of ma-
chine Learning research 12 (2011), pp. 2825-2830.
[17] George AF Seber and Alan J Lee. Linear regression analysis. Vol. 329. John Wiley & Sons,
2012.
[18] Donald F Specht et al. “A general regression neural network”. In: IEEE transactions on
neural networks 2.6 (1991), pp. 568-576.
[19] Sandro Tosi. Matplotlib for Python developers. Packt Publishing Ltd, 2009.

127
128 BIBLIOGRAPHY

[20] Lloyd N Trefethen and David Bau III. Numerical linear algebra. Vol. 50. Siam, 1997.
[21] Wil Van Der Aalst. “Data science in action”. In: Process mining. Springer, 2016, pp. 3-23. [22] Jake
VanderPlas. Python data science handbook: Essential tools for working with data. " O’Reilly
Media, Inc.", 2016.
[23] Svante Wold, Kim Esbensen, and Paul Geladi. “Principal component analysis”. In: Chemo-
metrics and intelligent laboratory systems 2.1-3 (1987), pp. 37-52.
[24] Raymond E Wright. “Logistic regression.” In: (1995).

You might also like