Machine Learning libraries (NumPy, SciPy, matplotlib, scikit-learn, pandas)
Short introduction of python libraries which are used widely for Machine Learning like NumPy, SciPy,
matplotlib, scikit-learn, pandas
Till today I have written all tutorials without libraries and now I’m taking our journey to next level
where we will use python libraries for classification, visualization and clustering. In this article, we
will have a short introduction of NumPy, SciPy, matplotlib, scikit-learn, pandas.
NumPy
NumPy basically provides n-dimensional array object. NumPy also provides mathematical
functions which can be used in many calculations.
Command to install: pip install numpy
import numpy as np
arr = np.array([[1,2,3],[4,5,6]])
print("Numpy array
{}".format(arr))
Output
Output
Numpy array
[[1 2 3]
[4 5 6]]
SciPy
SciPy is collection of scientific computing functions. It provides advanced linear algebra routines,
mathematical function optimization, signal processing, special mathematical functions, and
statistical distributions.
Command to install: pip install scipy
from scipy import sparse
# Create a 2D NumPy array with a diagonal of ones, and zeros everywhere else
eye = np.eye(3)
print("NumPy array:
{}".format(eye))
sparse_matrix = sparse.csr_matrix(eye)
print("
SciPy sparse CSR matrix:
{}".format(sparse_matrix))
Output
NumPy array:
[[1. 0. 0.]
[0. 1. 0.]
[0. 0. 1.]]
SciPy sparse CSR matrix:
(0, 0) 1.0
(1, 1) 1.0
(2, 2) 1.0
matplotlib
matplotlib is scientific plotting library usually required to visualize data. Importantly visualization is
required to analyze the data. You can plot histograms, scatter graphs, lines etc.
Command to install: pip install matplotlib
import matplotlib.pyplot as plt
x = [1,2,3]
y = [4,5,6]
plt.scatter(x,y)
plt.show()
Output
scikit-learn
scikit-learn is built on NumPy, SciPy and matplotlib provides tools for data analysis and data
mining. It provides classification and clustering algorithms built in and some datasets for practice like
iris dataset, Boston house prices dataset, diabetes dataset etc.
Command to install: pip install scikit-learn
from sklearn import datasets
iris_data = datasets.load_iris()
sample = iris_data['data'][:3]
print("iris dataset sample data:
{}".format(iris_data['feature_names']))
print("{}".format(sample))
Output
iris dataset sample data:
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
[[5.1 3.5 1.4 0.2]
[4.9 3. 1.4 0.2]
[4.7 3.2 1.3 0.2]]
pandas
pandas is used for data analysis it can take multi-dimensional arrays as input and produce
charts/graphs. pandas may take a table with columns of different datatypes. It may ingest data from
various data files and database like SQL, Excel, CSV etc.
Command to install: pip install pandas
import pandas as pd
age = {'age': [4, 6, 8, 34, 5, 30, 41] }
dataframe = pd.DataFrame(age)
print("all age:
{}".format(dataframe))
filtered = dataframe[dataframe.age > 20]
print("age above 20:
{}".format(filtered))
Output
all age:
age
0 4
1 6
2 8
3 34
4 5
5 30
6 41
age above 20:
age
3 34
5 30
6 41
Requests Beautiful Soup lxml Selenium
Purpose Simplify making HTTP Parsing Parsing Simplify making HTTP
requests requests
Ease-of-use High High Medium Medium
Speed Fast Fast Very fast Slow
Learning Curve Very easy (beginner- Very easy (beginner- Easy Easy
friendly) friendly)
Documentation Excellent Excellent Good Good
JavaScript Support None None None Yes
CPU and Memory Usage Low Low Low High
Size of Web Scraping Project Large and small Large and small Large and Small
Supported small
Web scraping Python libraries compared