Machine Learning With Data Science
Machine Learning With Data Science
Chapter 1
An Exploration of Python
Libraries in Machine Learning
Models for Data Science
Jawahar Sundaram Sujith Jayaprakash
https://orcid.org/0000-0002-8101- https://orcid.org/0000-0003-1933-
8725 6922
CHRIST University (Deemed), India BlueCrest University College, Ghana
S. Gokuldev M. Thenmozhi
https://orcid.org/0000-0001-8393- https://orcid.org/0009-0002-0846-
4674 2325
Rathinam College of Arts and Science, Sri Eshwar College of Engineering,
India India
ABSTRACT
Python libraries are used in this chapter to create data science models. Data
science is the construction of models that can predict and act on data, which is a
subset of machine learning. Data science is an essential component of a number of
fields because of the exponential growth of data. Python is a popular programming
language for implementing machine learning models. The chapter discusses machine
DOI: 10.4018/978-1-6684-8696-2.ch001
Copyright © 2023, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
An Exploration of Python Libraries in Machine Learning Models
learning’s role in data science, Python’s role in this field, as well as how Python can
be utilized. A breast cancer dataset is used as a data source for building machine
learning models using Python libraries. Pandas, numpy, matplotlib, seaborn, scikit-
learn, and tensorflow are some Python libraries discussed in this chapter, in addition
to data preprocessing methods. A number of machine learning models for breast
cancer treatment are discussed using this dataset and Python libraries. A discussion
of machine learning’s future in data science is provided at the conclusion of the
chapter. Python libraries for machine learning are very useful for data scientists
and researchers in general.
1. INTRODUCTION
The process of data mining (DM) involves the preparation of data from different
sources, such as databases, text files, streams, as well as the modeling of that data
using a variety of techniques, depending on the goal that one is trying to achieve, such
as classification, clustering, regression, association rule mining. The use of machine
learning (ML) techniques in DM enables the discovery of new knowledge in the
organization. Data preparation is a part of data analysis which includes preprocessing
and manipulating the data as part of the analysis process. There are many aspects
involved with data preprocessing, such as cleaning, integrating, transforming and
reducing raw data to make it more suitable for analysis, and there are other aspects
involved in data wrangling, which is a process of taking the preprocessed data and
changing its format so that it can be easily modelled.
Machine learning has grown rapidly in the past few years, and today there are
many types and subtypes of machine learning. In the field of machine learning,
you are studying what makes computers capable of learning on their own without
the need to be explicitly programmed. Using this method, it is possible to solve
problems that cannot be solved numerically. Machine learning models can either
be classified, grouped, or regressed depending on the purpose they are intended to
serve. A linear regression model is used to understand the relationship between the
inputs and the outputs of a project’s numerical values.
There are many classification models that can be used to identify a particular
post’s sentiment. An individual’s review can be classified as either positive or negative
based on the words used in it. Using these models, it is possible to classify emails as
spam or not based on their contents. Using a clustering model, we are able to find
objects with characteristics that are similar to each other. ML algorithms in many
different parts of the world are used in interesting and interesting ways.
2
An Exploration of Python Libraries in Machine Learning Models
Machine learning is a powerful tool for data scientists to analyze large amounts
of data and to develop models to make predictions and decisions. It enables data
scientists to automate the process of finding patterns and insights in data, which
can be used to improve the efficiency and accuracy of decision-making. Machine
learning can also be used to create new products and services. It can be used to
automate tasks and improve customer experience. It can also be used to identify
new opportunities and reduce costs. By using machine learning algorithms, data
scientists can quickly process large amounts of data and find patterns and correlations
that would otherwise be difficult to find. This enables them to make more informed
decisions and create new products and services that are more tailored to customer
needs. Additionally, machine learning can be used to automate mundane tasks, such
as customer service, which can help improve customer experience and reduce costs.
The objective of the book chapter is 1. To provide basic understanding of python
libraries used in data science for machine learning process, 2. How these libraries is
used for analysing biological sequences and 3. To identify differentially expressed
gene in the sequence for different predictions.
3
An Exploration of Python Libraries in Machine Learning Models
4
Table 1.
PyTorch is a widely used deep learning library known for its dynamic computational
Adam Paszke Advances in Neural Information graph and efficient GPU acceleration. It offers a seamless development experience with
PyTorch 2019
et al. Processing Systems support for automatic differentiation and a flexible design that enables researchers to
implement complex deep learning models with ease.
Proceedings of the 22nd ACM XGBoost is an optimized gradient boosting library that excels in solving structured
SIGKDD International Conference data problems. It provides an implementation of the gradient boosting algorithm with
XGBoost Tianqi Chen et al. 2016
on Knowledge Discovery and Data enhanced performance and scalability. XGBoost has won numerous data science
Mining competitions and is widely used in various domains.
LightGBM is a gradient boosting framework that focuses on efficiency and accuracy.
It employs a novel tree-based learning algorithm and offers faster training speeds
Neural Information Processing
LightGBM Guolin Ke et al. compared to other gradient boosting implementations. LightGBM is particularly 2017
Systems
suitable for large-scale datasets and has gained popularity for its competitive
performance.
5
An Exploration of Python Libraries in Machine Learning Models
The Python library system can be classified into two different categories: built-in
libraries and external libraries. Python is a language that includes built-in libraries,
so these libraries do not need to be installed separately from Python. In contrast,
third-party libraries are created by third-party developers and must be installed
before they can be used (Abadi et al., 2016).
In Python scripts, import statements are used to include Python libraries, and
these libraries are written as .py files, which are used to store them in Python scripts.
Whenever a library is imported, its classes, functions, and variables are available to
the script in order to make use of them (Chollet, 2017).
The Python programming language supports a variety of libraries that are useful
for data analysis and visualization, web development, machine learning, and scientific
computing. Some of the popular Python libraries are represented in Figure 2.
6
An Exploration of Python Libraries in Machine Learning Models
Table 2.
The Python programming language is one of the most popular languages for data
science, and there are many libraries available that can be used for performing
machine learning, doing data analysis, and doing other tasks as well. Here are some
of the most commonly used libraries for data science in Python:
NumPy
Python’s NumPy library is one of the most popular libraries for scientific computations.
It is possible to perform mathematical operations on arrays and matrices by using
functions available in the library.
7
An Exploration of Python Libraries in Machine Learning Models
Example Code:
import numpy as np
# Create a NumPy array of integers
a = np.array([1, 2, 3, 4, 5])
# Create a NumPy array of floating-point numbers
b = np.array([0.1, 0.2, 0.3, 0.4, 0.5])
# Perform some numerical operations on the arrays
c = a + b
d = a * b
e = np.sqrt(a)
# Print the arrays and the results of the operations
print(‘a:’, a)
Pandas
Pandas is a Python library that can be used to manipulate and analyze data in a variety
of ways. A similar concept can be applied to data structures such as data frames
and series, in addition to the functions for cleaning, transforming, and analyzing
data that are provided.
Example Code:
import pandas as pd
# Create a Pandas DataFrame
df = pd.DataFrame({
‘Name’: [‘Alice’, ‘Bob’, ‘Charlie’, ‘David’],
‘Age’: [25, 32, 18, 47],
‘Gender’: [‘F’, ‘M’, ‘M’, ‘M’] })
# Print the DataFrame
print(df)
# Filter the DataFrame to only show rows where Age is greater
than 30
df_filtered = df[df[‘Age’] > 30]
# Print the filtered DataFrame
print(df_filtered)
# Group the DataFrame by Gender and calculate the mean Age for
each group
df_grouped = df.groupby(‘Gender’).agg({‘Age’: ‘mean’})
# Print the grouped DataFrame
print(df_grouped)
8
An Exploration of Python Libraries in Machine Learning Models
Matplotlib
Scikit-learn
9
An Exploration of Python Libraries in Machine Learning Models
TensorFlow
In Python, TensorFlow provides support for a variety of machine learning and deep
learning algorithms. A number of tools are included in the program, including tools
for evaluating and deploying models, as well as tools for creating and training neural
networks.
Example Code:
import tensorflow as tf
from tensorflow.keras.datasets import mnist
# Load the MNIST dataset
(x_train, y_train), (x_test, y_test) = mnist.load_data()
# Normalize the data
x_train = x_train / 255.0
x_test = x_test / 255.0
# Create a simple neural network model
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation=’relu’),
tf.keras.layers.Dense(10)])
# Compile the model
model.compile(optimizer=’adam’,
loss=tf.keras.losses.SparseCategoricalCrossentrop
y(from_logits=True),
metrics=[‘accuracy’])
# Train the model
model.fit(x_train, y_train, epochs=5)
# Evaluate the model on the testing data
test_loss, test_acc = model.evaluate(x_test, y_test, verbose=2)
print(f”Test accuracy: {test_acc}”)
Keras
10
An Exploration of Python Libraries in Machine Learning Models
Example Code:
import numpy as np
import keras
from keras.models import Sequential
from keras.layers import Dense
# Create a toy dataset
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([[0], [1], [1], [0]])
# Create a simple neural network model
model = Sequential()
model.add(Dense(8, input_dim=2, activation=’relu’))
model.add(Dense(1, activation=’sigmoid’))
# Compile the model
model.compile(loss=’binary_crossentropy’, optimizer=’adam’,
metrics=[‘accuracy’])
# Train the model
model.fit(X, y, epochs=1000, verbose=0)
# Evaluate the model on the training data
scores = model.evaluate(X, y)
print(f”{model.metrics_names[1]}: {scores[1]*100}”)
Statsmodels
import numpy as np
import statsmodels.api as sm
# Create a toy dataset
x = np.array([0, 1, 2, 3, 4, 5])
y = np.array([1, 3, 2, 5, 7, 8])
# Add a constant to the data for the intercept term
x = sm.add_constant(x)
# Fit a simple linear regression model
model = sm.OLS(y, x).fit()
11
An Exploration of Python Libraries in Machine Learning Models
There are a number of Python libraries that are available that can be used for
data science in Python. As with all libraries, there are strengths and weaknesses,
so you should decide which one is best suited to your specific needs based on their
strengths and weaknesses (Hunter, 2007).
Table 3.
A number of Python libraries are available for machine learning, and Python is a
popular programming language for machine learning. Here are some of the most
commonly used libraries for machine learning in Python:
12
An Exploration of Python Libraries in Machine Learning Models
Table 4.
Programming
Library Description Main Use Cases
Language
Machine learning library for Classification, regression,
Scikit-learn Python
traditional models clustering, etc.
Neural networks, natural
TensorFlow Deep learning framework Python, C++, Java
language processing
Rapid prototyping, easy model
Keras High-level API for deep learning Python
building
Neural networks, natural
PyTorch Deep learning framework Python
language processing
OpenCV Computer vision library Image and video processing C++, Python, Java
Django
Flask
Developers can build web applications quickly using the Flask programming language,
since it is flexible and allows them to write code quickly. The framework is designed
to be lightweight and easy to use, which makes it a good choice for beginners. Even
though it lacks some built-in features, it offers a wide range of extensions that can
be installed to customize the interface to meet your own needs.
13
An Exploration of Python Libraries in Machine Learning Models
Example Code:
Pyramid
An open source framework that supports a wide range of web applications ranging
from small to large in size. This Pyramid is highly customizable, and it adheres to
the “Don’t Repeat Yourself” (DRY) principle, which means it is extremely flexible.
There are a number of features included on the site, such as URL routing, templating,
and security.
Example Code:
Bottle
Using this micro web framework is very simple and easy to do. Using a Bottle
application will save you a lot of time and money because it is lightweight, fast,
and provides a set of basic features like routing, templating, and data integration
that are extremely useful.
Example Code:
14
An Exploration of Python Libraries in Machine Learning Models
Web2py
The framework includes many built-in features, such as an ORM, an admin interface,
and a web-based development environment, all of which make it much easier to use.
Web2py is a framework based on the MVC architectural pattern, and it is designed
to be both scalable and secure.
Example Code:
def index():
return ‘Hello, world!’
def user():
return ‘User page’
The Python language offers a wide variety of libraries and frameworks for web
development which are available for free. A different library or framework may be
necessary in order to meet the specific requirements of your project.
Table 5.
15
An Exploration of Python Libraries in Machine Learning Models
Several libraries for the visualization of data are available in the Python programming
language that are popular with data scientists and analysts (Reback & McKinney,
2020; Satyanarayan et al., 2017; Wickham, 2009).
Matplotlib
Data visualization libraries such as Matplotlib allow easy visualization of data with
their ease of use and high level of customization. There is a wide range of charts
you can create using this library, including line, bar, scatter, and histogram charts,
to showcase your data.
Example Code:
Seaborn
Based on the Matplotlib library, Seaborn provides an interface for creating more
advanced visualizations of data based on the Seaborn data visualization library.
There are a variety of ways to use it, including creating heatmaps, violin plots, data
regression plots, and other methods, with which it can be used for exploratory data
analysis.
Example Code:
16
An Exploration of Python Libraries in Machine Learning Models
Plotly
Plotly provides a data visualization library that allows you to create online interactive
charts and graphs in minutes. It is possible to create scatter plots, line charts, bar
charts, as well as visualizations of data in three dimensions, using the program.
Example Code:
import plotly.graph_objs as go
# Sample data
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]
# Create a plotly figure
fig = go.Figure(data=go.Scatter(x=x, y=y, mode=’markers’))
# Add labels and title
fig.update_layout(title=’Sample plot’, xaxis_title=’X-axis’,
yaxis_title=’Y-axis’)
# Show the plot
fig.show()
ggplot
17
An Exploration of Python Libraries in Machine Learning Models
Altair
Table 6.
18
An Exploration of Python Libraries in Machine Learning Models
Python is a programming language that supports several libraries that can be used
to retrieve data from websites, which is called web scraping in Python (Cock et al.,
2009; Jarvis et al., 2006). Here are some of the most commonly used ones:
Beautiful Soup
Python is used in the Beautiful Soup library to extract information from HTML or
XML documents. There is a simple and intuitive interface that makes it easy for
users to parse HTML and extract data from its contents.
Example Code:
import requests
from bs4 import BeautifulSoup
# Make a request to a web page
page = requests.get(‘https://en.wikipedia.org/wiki/Python_
(programming_language)’)
# Create a Beautiful Soup object
soup = BeautifulSoup(page.content, ‘html.parser’)
# Find the page title
title = soup.title.string
# Find all the paragraph tags on the page
paragraphs = soup.find_all(‘p’)
# Print the page title and the first paragraph
print(title)
print(paragraphs[0])
Scrapy
import scrapy
class MySpider(scrapy.Spider):
name = ‘myspider’
start_urls = [‘https://en.wikipedia.org/wiki/Python_
19
An Exploration of Python Libraries in Machine Learning Models
(programming_language)’]
def parse(self, response):
# Find the page title
title = response.css(‘title::text’).get()
# Find all the paragraph tags on the page
paragraphs = response.css(‘p::text’).getall()
# Print the page title and the first paragraph
print(title)
print(paragraphs[0])
Selenium
In Python’s Selenium library, web browsers are automatically tested using automated
processes. By automating interactions with web pages, such as clicking buttons and
completing forms, it can be used for web scraping.
Example Code:
Requests
Requests is a Python library that can be used to make HTTP requests using the Python
language. The scraping of websites is accomplished by sending HTTP requests to
a site and parsing the HTML or JSON response that is returned.
20
An Exploration of Python Libraries in Machine Learning Models
Example Code:
import requests
# Make a GET request to a web page
response = requests.get(‘https://en.wikipedia.org/wiki/Python_
(programming_language)’)
# Print the response status code
print(response.status_code)
# Print the response content
print(response.content)
PyQuery
PyQuery is a Python library that is similar to jQuery in many ways. The interface
makes it easy to parse and extract data from HTML and XML documents thanks
to its simple and intuitive design.
Example Code:
21
An Exploration of Python Libraries in Machine Learning Models
LXML
Parsing XML and HTML documents is performed using a Python library called
LXML, which is written in Python. This library provides many features that can
help you work with XML and HTML data in a variety of ways.
Example Code:
Table 7.
22
An Exploration of Python Libraries in Machine Learning Models
Biopython
Biopython is a library for Python that provides an extensive set of tools for doing
bioinformatics tasks. It can be used for a variety of bioinformatics tasks, including
the analysis of sequences, the analysis of protein structures, as well as parsing of
most common file formats.
SciPy
The SciPy library is one of the Python libraries that can be used for scientific
computations. There are various bioinformatics tasks that can be carried out with
it, including statistical analysis, optimization, and signal processing.
Table 8.
Machine learning models are used to analyze large datasets to identify patterns
and trends in the data that can be used to make predictions about the risk of breast
cancer. The models are trained on the data to recognize characteristics associated
with the disease. These predictions can help doctors make more accurate diagnoses
and improve treatment decisions. Additionally, machine learning models can identify
risk factors that may not have been previously known. Machine learning models can
also be used to monitor a patient’s condition over time, allowing doctors to detect
changes that may indicate a worsening of the disease. This can be used to adjust
treatment plans accordingly and improve patient outcomes (McKinney, 2012; Reback
et al., 2020; Rossant, 2014).
23
An Exploration of Python Libraries in Machine Learning Models
Table 9.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
24
An Exploration of Python Libraries in Machine Learning Models
Figure 3.
plt.figure(figsize=(20,10))
sns.heatmap(df_cancer.corr(), annot=True)
25
An Exploration of Python Libraries in Machine Learning Models
Figure 4.
26
An Exploration of Python Libraries in Machine Learning Models
Figure 5.
print(classification_report(y_test, y_predict))
Table 6.
27
An Exploration of Python Libraries in Machine Learning Models
In this code, we first load the breast cancer dataset and normalize the data
using the StandardScaler function from scikit-learn. We then perform principal
component analysis (PCA) to reduce the dimensionality of the data to 2, and filter
for differentially expressed genes using the mean and quantile functions from pandas.
Finally, we perform clustering analysis using the KMeans function from scikit-learn
and visualize the results using seaborn.
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
# Load breast cancer dataset
cancer = load_breast_cancer()
df = pd.DataFrame(cancer[‘data’], columns=cancer[‘feature_
names’])
# Normalize data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df)
# Perform PCA to reduce dimensions
pca = PCA(n_components=2)
pca_data = pca.fit_transform(scaled_data)
# Filter for differentially expressed genes
diff_exp_genes = df.columns[df.mean() > df.mean().
quantile(0.75)]
# Perform clustering analysis
kmeans = KMeans(n_clusters=2)
kmeans.fit(pca_data)
# Visualize results
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
sns.scatterplot(x=pca_data[:, 0], y=pca_data[:, 1],
hue=cancer[‘target’], palette=’coolwarm’)
plt.title(‘Actual Labels’)
plt.subplot(1, 2, 2)
sns.scatterplot(x=pca_data[:, 0], y=pca_data[:, 1], hue=kmeans.
28
An Exploration of Python Libraries in Machine Learning Models
labels_, palette=’coolwarm’)
plt.title(‘K-Means Clustering’)
plt.show()
Figure 6.
29
An Exploration of Python Libraries in Machine Learning Models
REFERENCES
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J. M., Devin, M.,
Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore,
S., Murray, D. G., Steiner, B., Tucker, P. A., Vasudevan, V., Warden, P., . . . Zheng, X.
(2016). TensorFlow: A system for large-scale machine learning. Operating Systems
Design and Implementation, 265–283. https://doi.org/ doi:10.5555/3026877.3026899
Chollet, F. (2017). Deep Learning with Python. http://cds.cern.ch/record/2301910
Cock, P. J. A., Antao, T., Chang, J. T., Chapman, B., Cox, C. J., Dalke, A., Friedberg,
I., Hamelryck, T., Kauff, F., Wilczyński, B., & De Hoon, M. (2009). Biopython:
Freely available Python tools for computational molecular biology and bioinformatics.
Bioinformatics (Oxford, England), 25(11), 1422–1423. doi:10.1093/bioinformatics/
btp163 PMID:19304878
Hunter, J. (2007). MatPlotLib: A 2D Graphics environment. Computing in Science
& Engineering, 9(3), 90–95. doi:10.1109/MCSE.2007.55
Jarvis, R. M., Broadhurst, D., Johnson, H. E., O’Boyle, N. M., & Goodacre, R. (2006).
PYCHEM: A multivariate analysis package for python. Bioinformatics (Oxford,
England), 22(20), 2565–2566. doi:10.1093/bioinformatics/btl416 PMID:16882648
30
An Exploration of Python Libraries in Machine Learning Models
31