scikit-learn
▪ introduction
▪ installation/distribution
▪ essential/auxiliary libraries
▪ usage
1
scikit-learn
▪ free
introduction--- ▪ open-source
▪ constantly being developed and improved
scikit-learn (also known as sklearn) is ▪ an active user community.
a free software machine ▪ state-of-the-art machine learning algorithms
learning library for ▪ provides nice documentation
the Python programming language. ▪ widely used in industry and academia
▪ a wealth of tutorials and code snippets are
available online.
▪ works well with many scientific Python
tools
2
scikit-learn
▪ for scientific computation
dependencies --- ▪ NumPy
▪ SciPy.
scikit-learn heavily relies on
NumPy and SciPy for its ▪ for plotting
▪ matplotlib
functions-moreover, can be used
more effectively with other ▪ for interactive development
auxiliary packages ▪ Ipython
▪ Jupyter Notebook
3
scikit-learn
installation--- 1
• Anaconda Free
(recommended)
▪ can be independently installed
▪ (recommended) can be 2
• Enthought canopy Not free
installed via a number of
• Python( x, y ) Free
python distributions ⇛ 3
if you install any of these Python
distributions, scikit-learn comes
packaged with it -
4
scikit-learn
comes with:
▪ NumPy,
▪ SciPy,
▪
anaconda --- ▪
matplotlib,
pandas,
a Python distribution for ▪ IPython,
▪ Jupyter Notebook,
large-scale data processing,
▪ scikit-learn
predictive analytics, and
scientific computing ⇛ available on:
▪ Mac
▪ OS
▪ Windows
5
scikit-learn
Jupyter Notebook
• provides an interactive environment
libraries--- • runs code in the browser.
• great tool for exploratory data analysis
essentially required or increase the
⇛ NumPy •
•
widely used by data scientists
supports many programming languages
effectiveness of scikit-learn
⇛ SciPy
⇛ Jupyter Notebook NumPy
⇛ matplotlib • fundamental packages for scientific computing
• provides functionality for:
⇛ Pandas • multidimensional arrays
• high-level mathematical functions, e.g.,
• linear algebra operations
• Fourier transform
• pseudorandom number generators.
6
scikit-learn
NumPy, SciPy
:: strengths ::
7
scikit-learn
SciPy
libraries--- • a collection of functions for scientific computing
• provides, among other functionality:
essentially required or increase the
⇛ NumPy • advanced linear algebra routines,
effectiveness of scikit-learn
•
⇛ SciPy •
mathematical function optimization,
signal processing,
⇛ Jupyter Notebook • special mathematical functions,
⇛ matplotlib • statistical distributions.
⇛ Pandas • scikit-learn draws from SciPy’s collection of functions
for implementing its algorithms.
8
scikit-learn
matplotlib
libraries--- • primary scientific plotting library in Python
essentially required or increase the
• provides functions for
⇛ NumPy • making publication-quality visualizations:
effectiveness of scikit-learn
⇛ SciPy • line charts,
⇛ Jupyter Notebook • histograms,
• scatter plots,
⇛ matplotlib • and so on.
⇛ Pandas
9
scikit-learn
libraries--- pandas
essentially required or increase the
⇛ NumPy •
•
Python library for data wrangling and analysis
effectiveness of scikit-learn
built around a data structure called the DataFrame
⇛ SciPy • a DataFrame is a table
⇛ Jupyter Notebook • has methods for manipulating this table, e.g.,
• allows SQL-like queries and joins on such tables
⇛ matplotlib
⇛ Pandas
10
Fitting the Linear Regression Model
𝑚
𝜏 = 𝑥 𝑖 ,𝑦 𝑖
𝑖=1
, 𝑥 𝑖
𝜖ℝ 𝑛
,𝑦 𝑖
∈ℝ
(𝑖) (𝑖) (𝑖)
▪ model: 𝑦ො = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑛 𝑥𝑛
▪ model parameters: 𝑤0 , 𝑤1 , 𝑤1 ,…, 𝑤𝑛
▪ intercept: 𝑤0
▪ coefficients: 𝑤1 , 𝑤1 ,…, 𝑤𝑛
▪ dataset: the Boston data
11
the Boston data
• The Boston house-price data of Harrison, D. and
Rubinfeld, D. L. 'Hedonic prices and the demand
for clean air', J. Environ. Economics &
Management, vol.5, 81-102, 1978.
▪ Regression diagnostics: Identifying Influential
Data and Sources of Collinearity’…what
influences housing prices in Boston-
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
0.00632 18 2.31 0 0.538 6.575 65.2 4.09 1 296 15.3 396.9 4.98 24
0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.9 9.14 21.6
0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.9 5.33 36.2
0.02985 0 2.18 0 0.458 6.43 58.7 6.0622 3 222 18.7 394.12 5.21 28.7
0.08829 12.5 7.87 0 0.524 6.012 66.6 5.5605 5 311 15.2 395.6 12.43 22.9
0.14455 12.5 7.87 0 0.524 6.172 96.1 5.9505 5 311 15.2 396.9 19.15 27.1
0.21124 12.5 7.87 0 0.524 5.631 100 6.0821 5 311 15.2 386.63 29.93 16.5
12
the Boston housing example
𝑖 𝑖 𝑖
𝑥1 𝑥2
… 𝑥13 𝑦ො 𝑖
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
0.00632 18 2.31 0 0.538 6.575 65.2 4.09 1 296 15.3 396.9 4.98 24
0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.9 9.14 21.6
0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.9 5.33 36.2
0.02985 0 2.18 0 0.458 6.43 58.7 6.0622 3 222 18.7 394.12 5.21 28.7
0.08829 12.5 7.87 0 0.524 6.012 66.6 5.5605 5 311 15.2 395.6 12.43 22.9
0.14455 12.5 7.87 0 0.524 6.172 96.1 5.9505 5 311 15.2 396.9 19.15 27.1
0.21124 12.5 7.87 0 0.524 5.631 100 6.0821 5 311 15.2 386.63 29.93 16.5
𝑠𝑖𝑧𝑒 = 506 × (13 + 1)
𝑖 𝑖 𝑖
𝑦ො = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤13 𝑥13
𝑓: ℝ13 → ℝ
13
exploring the data 14
Steps ---
▪ import the dataset loader
▪ create the loader object
▪ explore/understand the data
▪ shape of the data
▪ description(DESCR)
▪ feature names/values feature values target values
▪ target names/values
names of the features
▪ file path
▪ etc.
information about the data
file path
exploring the data 15
Steps ---
▪ import the dataset loader
▪ create the loader object
▪ explore/understand the data #columns/
#features
#rows/
▪ shape of the data #training examples
▪ description(DESCR)
▪ feature names/values
▪ target names/values
▪ file path
▪ etc.
506 rows, 1 target
exploring the data 16
Steps ---
▪ import the dataset loader
▪ create the loader object
▪ explore/understand the data
▪ shape of the data
▪ description(DESCR)
▪ feature names/values
▪ target names/values
▪ file path
▪ etc.
exploring the data 17
Steps ---
▪ import the dataset loader
▪ create the loader object
▪ explore/understand the data
▪ shape of the data
▪ description(DESCR)
▪ feature names/values
▪ target names/values
▪ file path
▪ etc.
exploring the data 18
Steps ---
▪ import the dataset loader
▪ create the loader object
▪ explore/understand the data
▪ shape of the data
▪ description(DESCR)
▪ feature names/values
▪ target names/values
▪ file path
▪ etc.
exploring the data 19
Steps ---
▪ import the dataset loader
▪ create the loader object
▪ explore/understand the data
▪ shape of the data
▪ description(DESCR) …
▪ feature names/values
▪ target names/values
▪ file path
▪ etc.
exploring the data 20
training ---
▪ split the data into training(75%), test sets (25%)
▪ import the model
▪ fit the model to the data
▪ test the model
▪ predict
training the algorithm 21
training ---
▪ split the data into training(75%), test sets (25%)
▪ import the model
▪ fit the model to the data
▪ test the model
▪ predict
training the algorithm 22
training ---
▪ split the data into training/test sets
▪ import the model
▪ fit the model to the data
▪ test the model
▪ predict
training the algorithm 23
training ---
▪ split the data into training/test sets
▪ import the model
▪ fit the model to the data
▪ test the model
▪ predict
the iris data
Iris Flower--
Data about 150 iris flowers to
be classified into 3 varieties; Sepal length Sepal width Petal length Petal width specie
sitosa, versicolor, virginica 5.1 3.3 1.7 0.5 sitosa
4.9 3.0 1.4 0.2 versicolor
5.4 3.6 1.4 0.2 sitosa
6.0 2.7 5.1 1.5 virginica
size: 150 × (4 + 1)
24
25
training the algorithm
step 1
Steps---
▪ load the data
▪ explore the data step 2
▪ split into training and validation
subsets
▪ import the optimizer
step 3
26
training the algorithm
step 4
Steps---
▪ load the data
▪ explore the data step 5
▪ split into training and validation
subsets
▪ import the optimizer
▪ fit to the data (derive the model)
▪ check accuracy of the model on step 6
the data
27
training the algorithm
step 4
Steps---
▪ load the data
▪ explore the data step 5
▪ split into training and validation
subsets
▪ import the optimizer
▪ fit to the data (derive the model)
▪ check accuracy of the model on step 6
the data
▪ predict with the model derived
28
training the algorithm
step 7
Steps---
▪ load the data
▪ explore the data
▪ split into training and validation
subsets
▪ import the optimizer
▪ fit to the data (derive the model)
▪ check accuracy of the model on
the data
▪ predict with the model derived
29
training the algorithm
30
31
https://scikit-
learn.org/stable/modules/generated/sklearn.linear_model.Logi
sticRegression.html#sklearn.linear_model.LogisticRegression
32
end
33