0% found this document useful (0 votes)
34 views31 pages

Machine Learning With Data Science

This chapter explores the use of Python libraries in machine learning models for data science, highlighting their importance in data analysis and model creation. It discusses various libraries such as Pandas, NumPy, Scikit-learn, and TensorFlow, and demonstrates their application using a breast cancer dataset. The chapter aims to provide a foundational understanding of these libraries and their role in developing predictive models.

Uploaded by

modijiop2013
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views31 pages

Machine Learning With Data Science

This chapter explores the use of Python libraries in machine learning models for data science, highlighting their importance in data analysis and model creation. It discusses various libraries such as Pandas, NumPy, Scikit-learn, and TensorFlow, and demonstrates their application using a breast cancer dataset. The chapter aims to provide a foundational understanding of these libraries and their role in developing predictive models.

Uploaded by

modijiop2013
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

1

Chapter 1
An Exploration of Python
Libraries in Machine Learning
Models for Data Science
Jawahar Sundaram Sujith Jayaprakash
https://orcid.org/0000-0002-8101- https://orcid.org/0000-0003-1933-
8725 6922
CHRIST University (Deemed), India BlueCrest University College, Ghana

K. Gowri Harishchander Anandaram


Sri Ramakrishna College of Arts and https://orcid.org/0000-0003-2993-
Science, India 5304
Amrita Vishwa Vidyapeetham, India
S. Devaraju
https://orcid.org/0000-0003-3116- C. Manivasagan
4772 Rathnavel Subramaniam College of
VIT Bhopal University, India Arts and Science, India

S. Gokuldev M. Thenmozhi
https://orcid.org/0000-0001-8393- https://orcid.org/0009-0002-0846-
4674 2325
Rathinam College of Arts and Science, Sri Eshwar College of Engineering,
India India

ABSTRACT
Python libraries are used in this chapter to create data science models. Data
science is the construction of models that can predict and act on data, which is a
subset of machine learning. Data science is an essential component of a number of
fields because of the exponential growth of data. Python is a popular programming
language for implementing machine learning models. The chapter discusses machine

DOI: 10.4018/978-1-6684-8696-2.ch001

Copyright © 2023, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
An Exploration of Python Libraries in Machine Learning Models

learning’s role in data science, Python’s role in this field, as well as how Python can
be utilized. A breast cancer dataset is used as a data source for building machine
learning models using Python libraries. Pandas, numpy, matplotlib, seaborn, scikit-
learn, and tensorflow are some Python libraries discussed in this chapter, in addition
to data preprocessing methods. A number of machine learning models for breast
cancer treatment are discussed using this dataset and Python libraries. A discussion
of machine learning’s future in data science is provided at the conclusion of the
chapter. Python libraries for machine learning are very useful for data scientists
and researchers in general.

1. INTRODUCTION

The process of data mining (DM) involves the preparation of data from different
sources, such as databases, text files, streams, as well as the modeling of that data
using a variety of techniques, depending on the goal that one is trying to achieve, such
as classification, clustering, regression, association rule mining. The use of machine
learning (ML) techniques in DM enables the discovery of new knowledge in the
organization. Data preparation is a part of data analysis which includes preprocessing
and manipulating the data as part of the analysis process. There are many aspects
involved with data preprocessing, such as cleaning, integrating, transforming and
reducing raw data to make it more suitable for analysis, and there are other aspects
involved in data wrangling, which is a process of taking the preprocessed data and
changing its format so that it can be easily modelled.
Machine learning has grown rapidly in the past few years, and today there are
many types and subtypes of machine learning. In the field of machine learning,
you are studying what makes computers capable of learning on their own without
the need to be explicitly programmed. Using this method, it is possible to solve
problems that cannot be solved numerically. Machine learning models can either
be classified, grouped, or regressed depending on the purpose they are intended to
serve. A linear regression model is used to understand the relationship between the
inputs and the outputs of a project’s numerical values.
There are many classification models that can be used to identify a particular
post’s sentiment. An individual’s review can be classified as either positive or negative
based on the words used in it. Using these models, it is possible to classify emails as
spam or not based on their contents. Using a clustering model, we are able to find
objects with characteristics that are similar to each other. ML algorithms in many
different parts of the world are used in interesting and interesting ways.

2
An Exploration of Python Libraries in Machine Learning Models

Machine learning is a powerful tool for data scientists to analyze large amounts
of data and to develop models to make predictions and decisions. It enables data
scientists to automate the process of finding patterns and insights in data, which
can be used to improve the efficiency and accuracy of decision-making. Machine
learning can also be used to create new products and services. It can be used to
automate tasks and improve customer experience. It can also be used to identify
new opportunities and reduce costs. By using machine learning algorithms, data
scientists can quickly process large amounts of data and find patterns and correlations
that would otherwise be difficult to find. This enables them to make more informed
decisions and create new products and services that are more tailored to customer
needs. Additionally, machine learning can be used to automate mundane tasks, such
as customer service, which can help improve customer experience and reduce costs.
The objective of the book chapter is 1. To provide basic understanding of python
libraries used in data science for machine learning process, 2. How these libraries is
used for analysing biological sequences and 3. To identify differentially expressed
gene in the sequence for different predictions.

2. MACHINE LEARNING FRAMEWORK

A machine learning system is a software that is programmed to learn from past


experience and improve itself based on what it has learned. In the following section, we
will discuss how to apply machine learning to solve a problem using the architecture
components that can be seen in figure. Figure 1 illustrates a graphic representation
of the typical steps required to construct a machine learning model, which can be
seen as a result of the example in the figure. With the use of this framework, one
can develop predictive models that can be used in machine learning, data science,
and other areas that require predictive modelling.
A training step in the learning process is when the algorithms are tuned based on
the collected data that we have already collected. There are two types of data that
we use when training our algorithm, the training set and the test data. A learning
process is the process of constantly improving our software or machine by learning
new things and improving its capabilities.
A data preparation library is a crucial element of data science since it is heavily
reliant on data. A very popular library in the field of data science is Python’s pandas
library, which is among the best there is at the moment (McKinney, 2011). Besides
excel and csv, Pandas supports many other formats for input and output data, including
Python, HTML, and SQL as well as Excel, CSV, and Python.

3
An Exploration of Python Libraries in Machine Learning Models

Figure 1. Machine learning framework

The Pandas package also offers powerful querying capabilities, statistical


calculations, and basic visualization tools. It has a lot of documentation on it, but it is
most notable for its sometimes-confusing syntax, which may be the most significant
issue with this program.

3. PYTHON LIBRARIES OVERVIEW

An Python library is a collection of pre-written code that developers can use to


perform a number of standard tasks as opposed to having to write them from scratch
from scratch. In order to set up Python libraries, you will need to use a package
manager, such as pip or conda (Vanderplas, 2016).

4
Table 1.

Library Name Author(s) Journal Title Abstract Explanation Year of Publication


NumPy is a fundamental library for scientific computing in Python. It provides powerful
NumPy Travis Olliphant Proceedings of the IEEE data structures and efficient functions for numerical operations, making it an essential 2006
tool for data analysis and machine learning tasks.
Pandas is a popular library for data manipulation and analysis. It provides data
structures like DataFrames, which allow easy handling of structured data. Pandas also
Pandas Wes McKinney arXiv 2010
offers a wide range of functions for data cleaning, transformation, and exploration,
making it a valuable tool for data scientists.
Scikit-learn is a versatile machine learning library in Python. It provides a unified
Fabian Pedregosa Journal of Machine Learning interface for various machine learning algorithms and tools for data preprocessing,
Scikit-learn 2011
et al. Research model selection, and evaluation. With a large collection of algorithms and extensive
documentation, Scikit-learn is widely used for building machine learning models.
TensorFlow is an open-source library developed by Google for numerical computation
and machine learning. It offers a flexible framework for building and deploying deep
Martín Abadi
TensorFlow arXiv learning models, with support for distributed computing and deployment on various 2015
et al.
platforms. TensorFlow has gained popularity for its scalability and extensive ecosystem
of pre-trained models.
Keras is a high-level neural networks API written in Python. It provides a user-friendly
Journal of Machine Learning interface for building and training deep learning models. Keras allows quick prototyping
Keras François Chollet 2015
Research and experimentation with different architectures, making it a popular choice for both
beginners and experienced deep learning practitioners.
An Exploration of Python Libraries in Machine Learning Models

PyTorch is a widely used deep learning library known for its dynamic computational
Adam Paszke Advances in Neural Information graph and efficient GPU acceleration. It offers a seamless development experience with
PyTorch 2019
et al. Processing Systems support for automatic differentiation and a flexible design that enables researchers to
implement complex deep learning models with ease.
Proceedings of the 22nd ACM XGBoost is an optimized gradient boosting library that excels in solving structured
SIGKDD International Conference data problems. It provides an implementation of the gradient boosting algorithm with
XGBoost Tianqi Chen et al. 2016
on Knowledge Discovery and Data enhanced performance and scalability. XGBoost has won numerous data science
Mining competitions and is widely used in various domains.
LightGBM is a gradient boosting framework that focuses on efficiency and accuracy.
It employs a novel tree-based learning algorithm and offers faster training speeds
Neural Information Processing
LightGBM Guolin Ke et al. compared to other gradient boosting implementations. LightGBM is particularly 2017
Systems
suitable for large-scale datasets and has gained popularity for its competitive
performance.

5
An Exploration of Python Libraries in Machine Learning Models

The Python library system can be classified into two different categories: built-in
libraries and external libraries. Python is a language that includes built-in libraries,
so these libraries do not need to be installed separately from Python. In contrast,
third-party libraries are created by third-party developers and must be installed
before they can be used (Abadi et al., 2016).
In Python scripts, import statements are used to include Python libraries, and
these libraries are written as .py files, which are used to store them in Python scripts.
Whenever a library is imported, its classes, functions, and variables are available to
the script in order to make use of them (Chollet, 2017).
The Python programming language supports a variety of libraries that are useful
for data analysis and visualization, web development, machine learning, and scientific
computing. Some of the popular Python libraries are represented in Figure 2.

Figure 2. Different python libraries

As shown in Figure 2, Python libraries are shown in a graphical representation


that illustrates the different ways they can be used. There are many Python libraries
available for use, but this figure shows a brief overview of some of the most common
ones.

6
An Exploration of Python Libraries in Machine Learning Models

Table 2.

Library Real-World Example


NumPy Scientific computing, numerical operations
Pandas Data manipulation, analysis, and cleaning
Matplotlib Data visualization, plotting
TensorFlow Machine learning, deep learning
Keras Neural network models, deep learning
Scikit-learn Machine learning algorithms, model training
NLTK Natural language processing, text mining
BeautifulSoup Web scraping, parsing HTML and XML
Django Web development framework, building web applications
OpenCV Computer vision, image and video processing
PyTorch Deep learning, neural networks
PySpark Distributed computing, big data processing

4. DIFFERENT PYTHON LIBRARIES

A Python programming language is one of the most powerful programming


languages available nowadays, and it is used for a wide range of tasks, such as web
development, data analysis, and machine learning (Pedregosa et al., 2011). The
Python community has developed a broad range of libraries over the years, as a
result of which the Python language can perform a wide range of functions much
more easily and is more capable than it has ever been.

4.1 Python Libraries Used for Data Science

The Python programming language is one of the most popular languages for data
science, and there are many libraries available that can be used for performing
machine learning, doing data analysis, and doing other tasks as well. Here are some
of the most commonly used libraries for data science in Python:

NumPy

Python’s NumPy library is one of the most popular libraries for scientific computations.
It is possible to perform mathematical operations on arrays and matrices by using
functions available in the library.

7
An Exploration of Python Libraries in Machine Learning Models

Example Code:

import numpy as np
# Create a NumPy array of integers
a = np.array([1, 2, 3, 4, 5])
# Create a NumPy array of floating-point numbers
b = np.array([0.1, 0.2, 0.3, 0.4, 0.5])
# Perform some numerical operations on the arrays
c = a + b
d = a * b
e = np.sqrt(a)
# Print the arrays and the results of the operations
print(‘a:’, a)

Pandas

Pandas is a Python library that can be used to manipulate and analyze data in a variety
of ways. A similar concept can be applied to data structures such as data frames
and series, in addition to the functions for cleaning, transforming, and analyzing
data that are provided.
Example Code:

import pandas as pd
# Create a Pandas DataFrame
df = pd.DataFrame({
‘Name’: [‘Alice’, ‘Bob’, ‘Charlie’, ‘David’],
‘Age’: [25, 32, 18, 47],
‘Gender’: [‘F’, ‘M’, ‘M’, ‘M’] })
# Print the DataFrame
print(df)
# Filter the DataFrame to only show rows where Age is greater
than 30
df_filtered = df[df[‘Age’] > 30]
# Print the filtered DataFrame
print(df_filtered)
# Group the DataFrame by Gender and calculate the mean Age for
each group
df_grouped = df.groupby(‘Gender’).agg({‘Age’: ‘mean’})
# Print the grouped DataFrame
print(df_grouped)

8
An Exploration of Python Libraries in Machine Learning Models

Matplotlib

The Matplotlib Python library is used to create static visualizations, animated


visualizations, and interactive visualizations using Python. It has a wide range of
chart types that can be customized, and it supports many formats.
Example Code:

import matplotlib.pyplot as plt


# Create some sample data
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]
# Create a line chart
plt.plot(x, y)
# Add some labels and a title
plt.xlabel(‘X axis’)
plt.ylabel(‘Y axis’)
plt.title(‘Sample line chart’)
# Display the chart
plt.show()

Scikit-learn

In Python, there is a machine-learning library called scikit-learn, which is part of the


scikit-learn library. It supports machine learning algorithms such as classification,
regression, and clustering, as well as tools for selecting and evaluating models, as
well as tools for selecting and evaluating models.
Example Code:

from sklearn.datasets import load_iris


from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
# Load the iris dataset
iris = load_iris()
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(iris.data,
iris.target, test_size=0.2)
# Train a decision tree classifier on the training data
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
# Test the classifier on the testing data

9
An Exploration of Python Libraries in Machine Learning Models

accuracy = clf.score(X_test, y_test)


print(f”Accuracy: {accuracy}”)

TensorFlow

In Python, TensorFlow provides support for a variety of machine learning and deep
learning algorithms. A number of tools are included in the program, including tools
for evaluating and deploying models, as well as tools for creating and training neural
networks.
Example Code:

import tensorflow as tf
from tensorflow.keras.datasets import mnist
# Load the MNIST dataset
(x_train, y_train), (x_test, y_test) = mnist.load_data()
# Normalize the data
x_train = x_train / 255.0
x_test = x_test / 255.0
# Create a simple neural network model
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation=’relu’),
tf.keras.layers.Dense(10)])
# Compile the model
model.compile(optimizer=’adam’,
loss=tf.keras.losses.SparseCategoricalCrossentrop
y(from_logits=True),
metrics=[‘accuracy’])
# Train the model
model.fit(x_train, y_train, epochs=5)
# Evaluate the model on the testing data
test_loss, test_acc = model.evaluate(x_test, y_test, verbose=2)
print(f”Test accuracy: {test_acc}”)

Keras

Keras is a Python-based high-level neural network library that is used on TensorFlow,


Theano, and CNTK platforms to build powerful neural networks. It has an easy-
to-use interface that makes the process of building deep learning models a breeze.

10
An Exploration of Python Libraries in Machine Learning Models

Example Code:

import numpy as np
import keras
from keras.models import Sequential
from keras.layers import Dense
# Create a toy dataset
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([[0], [1], [1], [0]])
# Create a simple neural network model
model = Sequential()
model.add(Dense(8, input_dim=2, activation=’relu’))
model.add(Dense(1, activation=’sigmoid’))
# Compile the model
model.compile(loss=’binary_crossentropy’, optimizer=’adam’,
metrics=[‘accuracy’])
# Train the model
model.fit(X, y, epochs=1000, verbose=0)
# Evaluate the model on the training data
scores = model.evaluate(X, y)
print(f”{model.metrics_names[1]}: {scores[1]*100}”)

Statsmodels

In order to perform statistical modeling and analysis, Statsmodels is a Python


library that can be used. Data exploration and visualization tools are included in
the package, as well as support for a variety of statistical tests and models that can
be used to examine the data.
Example Code:

import numpy as np
import statsmodels.api as sm
# Create a toy dataset
x = np.array([0, 1, 2, 3, 4, 5])
y = np.array([1, 3, 2, 5, 7, 8])
# Add a constant to the data for the intercept term
x = sm.add_constant(x)
# Fit a simple linear regression model
model = sm.OLS(y, x).fit()

11
An Exploration of Python Libraries in Machine Learning Models

# Print the model summary


print(model.summary())

There are a number of Python libraries that are available that can be used for
data science in Python. As with all libraries, there are strengths and weaknesses,
so you should decide which one is best suited to your specific needs based on their
strengths and weaknesses (Hunter, 2007).

Table 3.

Library Description Key Features


Numerical computing libraries such as this Arrays, mathematical operations, the generation
one support multi-dimensional arrays and of random numbers, linear algebra, and Fourier
NumPy
matrix operations in all their forms. analysis are just some of the skills one can acquire
through learning and practicing them.
A library for data manipulation and A number of data processing tasks are involved:
analysis. The DataFrame data structure is preprocessing and cleaning data, exploring and
Pandas one of the most flexible and powerful data analyzing data, indexing and selecting data, and
structures available for handling tabular merging and combining data sets.
data very easily.
The library provides a wide variety of plots Plots can be customized in a variety of ways
and charts including line plots, scatter plots, including 3D plots, subplots, animations, and
Matplotlib
bar plots, and histogram plots, as well as the interactive plots, as well as customizable plots
ability to define your own plots and charts. and subplots.
The machine learning library provides An array of preprocessing, clustering,
Scikit-learn algorithms for supervised learning as well classification, regression, dimension reduction,
as unsupervised learning. and model selection procedures are included.
Machine learning models can be built and Multi-language support, advanced APIs for
TensorFlow trained using the Google Machine Learning developing models at the highest level of
Platform, which is an open-source platform. abstraction, and automatic differentiation.
A TensorFlow-based neural network API Models that have already been trained, easy-
Keras that can be used to build and train deep to-use APIs, and support for convolutional,
learning models at a high level. recurrent, and dense networks are all included.
A collection of tests and models for Modeling linear and generalized linear systems,
statistical modeling that can be used in a mixed-effects models, time series analysis, and
Statsmodels
variety of applications. survival analyses are among the techniques
we use.

4.2 Python Libraries Used for Machine Learning

A number of Python libraries are available for machine learning, and Python is a
popular programming language for machine learning. Here are some of the most
commonly used libraries for machine learning in Python:

12
An Exploration of Python Libraries in Machine Learning Models

Table 4.

Programming
Library Description Main Use Cases
Language
Machine learning library for Classification, regression,
Scikit-learn Python
traditional models clustering, etc.
Neural networks, natural
TensorFlow Deep learning framework Python, C++, Java
language processing
Rapid prototyping, easy model
Keras High-level API for deep learning Python
building
Neural networks, natural
PyTorch Deep learning framework Python
language processing
OpenCV Computer vision library Image and video processing C++, Python, Java

4.3 Python Libraries Used for Web Development

Python’s extensive ecosystem of libraries and frameworks make it an easy and


enjoyable experience to develop websites with Python. Some of the most popular
ones are:

Django

In the world of high-level web frameworks, model-view-controller (MVC) is a pattern


of architecture that is widely used (Waskom, 2021). The Django framework includes
a database access system called ORM that allows you to work with databases, along
with a built-in editor that you can use to manage the contents of the site. As one
of the most important characteristics of this system, it is fast, secure, and scalable.
Example Code:

from django.http import HttpResponse


def index(request):
return HttpResponse(“Hello, world!”)

Flask

Developers can build web applications quickly using the Flask programming language,
since it is flexible and allows them to write code quickly. The framework is designed
to be lightweight and easy to use, which makes it a good choice for beginners. Even
though it lacks some built-in features, it offers a wide range of extensions that can
be installed to customize the interface to meet your own needs.

13
An Exploration of Python Libraries in Machine Learning Models

Example Code:

from flask import Flask


app = Flask(__name__)
@app.route(‘/’)
def index():
return ‘Hello, world!’
if __name__ == ‘__main__’:
app.run()

Pyramid

An open source framework that supports a wide range of web applications ranging
from small to large in size. This Pyramid is highly customizable, and it adheres to
the “Don’t Repeat Yourself” (DRY) principle, which means it is extremely flexible.
There are a number of features included on the site, such as URL routing, templating,
and security.
Example Code:

from pyramid.response import Response


from pyramid.view import view_config
@view_config(route_name=’home’)
def home(request):
return Response(‘Hello, world!’)

Bottle

Using this micro web framework is very simple and easy to do. Using a Bottle
application will save you a lot of time and money because it is lightweight, fast,
and provides a set of basic features like routing, templating, and data integration
that are extremely useful.
Example Code:

from bottle import route, run


@route(‘/’)
def index():
return ‘Hello, world!’
if __name__ == ‘__main__’:
run()

14
An Exploration of Python Libraries in Machine Learning Models

Web2py

The framework includes many built-in features, such as an ORM, an admin interface,
and a web-based development environment, all of which make it much easier to use.
Web2py is a framework based on the MVC architectural pattern, and it is designed
to be both scalable and secure.
Example Code:

def index():
return ‘Hello, world!’
def user():
return ‘User page’

The Python language offers a wide variety of libraries and frameworks for web
development which are available for free. A different library or framework may be
necessary in order to meet the specific requirements of your project.

Table 5.

Framework Pros Cons


Powerful built-in features, such as
authentication, URL routing, and object- Can be overkill for smaller projects, can
Django relational mapping (ORM), extensive be difficult to customize, and can have a
documentation, and a large community of steep learning curve for beginners.
developers.
Limited built-in features, which can
Easy to set up and get started, minimal
lead to more code and setup for larger
Flask boilerplate code, and a large ecosystem of
projects, and requires more manual
extensions and plugins.
configuration.
Offers a balance between simplicity and Can be less intuitive to use than other
Pyramid flexibility, provides a wide range of built-in frameworks, and requires more manual
features, and has excellent documentation. configuration for certain tasks.
Simple and easy to get started with, requires Limited built-in features, which can lead
Bottle minimal setup and configuration, and has a to more code for larger projects, and a
small footprint. smaller community of developers.
Offers a comprehensive set of built-in features,
such as authentication, database abstraction, Can be less flexible than other
Web2py and web-based IDE, has a simple and intuitive frameworks, and can have performance
syntax, and is designed to be easy to learn and issues with large-scale applications.
use.

15
An Exploration of Python Libraries in Machine Learning Models

4.4 Data Visualization

Several libraries for the visualization of data are available in the Python programming
language that are popular with data scientists and analysts (Reback & McKinney,
2020; Satyanarayan et al., 2017; Wickham, 2009).

Matplotlib

Data visualization libraries such as Matplotlib allow easy visualization of data with
their ease of use and high level of customization. There is a wide range of charts
you can create using this library, including line, bar, scatter, and histogram charts,
to showcase your data.
Example Code:

import matplotlib.pyplot as plt


# Sample data
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]
# Create a plot
plt.plot(x, y)
# Add labels and title
plt.xlabel(‘X-axis’)
plt.ylabel(‘Y-axis’)
plt.title(‘Sample plot’)
# Show the plot
plt.show()

Seaborn

Based on the Matplotlib library, Seaborn provides an interface for creating more
advanced visualizations of data based on the Seaborn data visualization library.
There are a variety of ways to use it, including creating heatmaps, violin plots, data
regression plots, and other methods, with which it can be used for exploratory data
analysis.
Example Code:

import seaborn as sns


import matplotlib.pyplot as plt
# Load the iris dataset
iris = sns.load_dataset(“iris”)

16
An Exploration of Python Libraries in Machine Learning Models

# Create a scatter plot of the iris data


sns.scatterplot(data=iris, x=”sepal_length”, y=”sepal_width”,
hue=”species”)
# Show the plot
plt.show()

Plotly

Plotly provides a data visualization library that allows you to create online interactive
charts and graphs in minutes. It is possible to create scatter plots, line charts, bar
charts, as well as visualizations of data in three dimensions, using the program.
Example Code:

import plotly.graph_objs as go
# Sample data
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]
# Create a plotly figure
fig = go.Figure(data=go.Scatter(x=x, y=y, mode=’markers’))
# Add labels and title
fig.update_layout(title=’Sample plot’, xaxis_title=’X-axis’,
yaxis_title=’Y-axis’)
# Show the plot
fig.show()

ggplot

It is a Python implementation of ggplot2 which is one of the most popular R libraries.


A data visualization can be created using this program by using a grammar of
graphics interfaces, and it is particularly useful for plots that have multiple layers
and aspects, and it is also very easy to use.
Example Code:

from plotnine import ggplot, aes, geom_point


# Sample data
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]
# Create a plotnine plot
plot = ggplot() + aes(x=x, y=y) + geom_point()
# Add labels and title

17
An Exploration of Python Libraries in Machine Learning Models

plot += ggplot() + labs(title=’Sample plot’, x=’X-axis’, y=’Y-


axis’)
# Show the plot
plot.draw()

Altair

Using Altair’s declarative visualization library, you can create an interactive


visualization of your data by using its declarative visualization. A wide variety of
chart types can be used, such as scatter plots, lines, and heatmaps, and they can all
be used in conjunction with a variety of data visualizations to provide dynamic and
engaging information.
Example Code:

import altair as alt


from vega_datasets import data
# Load the iris dataset
iris = data.iris()
# Create a scatter plot of the iris data
scatter_plot = alt.Chart(iris).mark_point().encode(
x=’sepalLength’,
y=’sepalWidth’,
color=’species’)
# Show the plot
scatter_plot.show()

Table 6.

Library Popular Charts


Matplotlib Line charts, bar charts, scatter plots, histograms
Seaborn Violin plots, box plots, count plots
Plotly Interactive line charts, scatter plots, bar charts
ggplot Line charts, bar charts, scatter plots, histograms
Altair Line charts, bar charts, scatter plots, histograms

18
An Exploration of Python Libraries in Machine Learning Models

4.5 Python Libraries Used for Web Scraping

Python is a programming language that supports several libraries that can be used
to retrieve data from websites, which is called web scraping in Python (Cock et al.,
2009; Jarvis et al., 2006). Here are some of the most commonly used ones:

Beautiful Soup

Python is used in the Beautiful Soup library to extract information from HTML or
XML documents. There is a simple and intuitive interface that makes it easy for
users to parse HTML and extract data from its contents.
Example Code:

import requests
from bs4 import BeautifulSoup
# Make a request to a web page
page = requests.get(‘https://en.wikipedia.org/wiki/Python_
(programming_language)’)
# Create a Beautiful Soup object
soup = BeautifulSoup(page.content, ‘html.parser’)
# Find the page title
title = soup.title.string
# Find all the paragraph tags on the page
paragraphs = soup.find_all(‘p’)
# Print the page title and the first paragraph
print(title)
print(paragraphs[0])

Scrapy

Scrapy is a Python framework that provides users with a comprehensive set of


tools for scraping the web. It is often used for large-scale web scraping projects, in
addition to URL management, data extraction, and spider middleware, and is widely
used for URL management.
Example Code:

import scrapy
class MySpider(scrapy.Spider):
name = ‘myspider’
start_urls = [‘https://en.wikipedia.org/wiki/Python_

19
An Exploration of Python Libraries in Machine Learning Models

(programming_language)’]
def parse(self, response):
# Find the page title
title = response.css(‘title::text’).get()
# Find all the paragraph tags on the page
paragraphs = response.css(‘p::text’).getall()
# Print the page title and the first paragraph
print(title)
print(paragraphs[0])

Selenium

In Python’s Selenium library, web browsers are automatically tested using automated
processes. By automating interactions with web pages, such as clicking buttons and
completing forms, it can be used for web scraping.
Example Code:

from selenium import webdriver


# Create a new Chrome browser instance
browser = webdriver.Chrome(‘/path/to/chromedriver’)
# Navigate to a web page
browser.get(‘https://en.wikipedia.org/wiki/Python_(programming_
language)’)
# Find the page title
title = browser.title
# Find all the paragraph tags on the page
paragraphs = [elem.text for elem in browser.find_elements_by_
tag_name(‘p’)]
# Print the page title and the first paragraph
print(title)
print(paragraphs[0])
# Close the browser
browser.quit()

Requests

Requests is a Python library that can be used to make HTTP requests using the Python
language. The scraping of websites is accomplished by sending HTTP requests to
a site and parsing the HTML or JSON response that is returned.

20
An Exploration of Python Libraries in Machine Learning Models

Example Code:

import requests
# Make a GET request to a web page
response = requests.get(‘https://en.wikipedia.org/wiki/Python_
(programming_language)’)
# Print the response status code
print(response.status_code)
# Print the response content
print(response.content)

PyQuery

PyQuery is a Python library that is similar to jQuery in many ways. The interface
makes it easy to parse and extract data from HTML and XML documents thanks
to its simple and intuitive design.
Example Code:

from pyquery import PyQuery as pq


# Parse an HTML string
html = ‘’’
<html>
<head>
<title>PyQuery Example</title>
</head>
<body>
<h1>PyQuery Example</h1>
<ul>
<li>Item 1</li>
<li>Item 2</li>
<li>Item 3</li>
</ul>
</body>
</html>
‘’’
doc = pq(html)
# Find the page title
title = doc(‘title’).text()
# Find all the list items on the page
items = [elem.text for elem in doc(‘li’)]

21
An Exploration of Python Libraries in Machine Learning Models

# Print the page title and the list items


print(title)
print(items)

LXML

Parsing XML and HTML documents is performed using a Python library called
LXML, which is written in Python. This library provides many features that can
help you work with XML and HTML data in a variety of ways.
Example Code:

from lxml import etree


# Parse an XML string
xml = ‘’’
<root>
<title>LXML Example</title>
<items>
<item>Item 1</item>
<item>Item 2</item>
<item>Item 3</item>
</items>
</root>
‘’’
doc = etree.fromstring(xml)
# Find the page title
title = doc.find(‘title’).text
# Find all the list items on the page
items = [elem.text for elem in doc.findall(‘.//item’)]
# Print the page title and the list items
print(title)
print(items)

Table 7.

Library Popular Use Cases


Beautiful Soup Scraping HTML and XML documents
Selenium Scraping dynamic web pages, testing web applications
Requests Scraping JSON and XML APIs
PyQuery Scraping HTML and XML documents
LXML Scraping HTML and XML documents

22
An Exploration of Python Libraries in Machine Learning Models

4.6 Python Libraries Used for Bioinformatics

Biopython

Biopython is a library for Python that provides an extensive set of tools for doing
bioinformatics tasks. It can be used for a variety of bioinformatics tasks, including
the analysis of sequences, the analysis of protein structures, as well as parsing of
most common file formats.

SciPy

The SciPy library is one of the Python libraries that can be used for scientific
computations. There are various bioinformatics tasks that can be carried out with
it, including statistical analysis, optimization, and signal processing.

Table 8.

Library Popular Use Cases


Biopython DNA and protein sequence analysis, structure prediction and analysis, phylogenetics
SciPy Statistical analysis, machine learning, data visualization
scikit-learn Predictive modeling, data analysis, pattern recognition
Matplotlib Data visualization, publication-quality figures

5. CANCER DNA SEQUENCE USING PYTHON LIBRARIES

Machine learning models are used to analyze large datasets to identify patterns
and trends in the data that can be used to make predictions about the risk of breast
cancer. The models are trained on the data to recognize characteristics associated
with the disease. These predictions can help doctors make more accurate diagnoses
and improve treatment decisions. Additionally, machine learning models can identify
risk factors that may not have been previously known. Machine learning models can
also be used to monitor a patient’s condition over time, allowing doctors to detect
changes that may indicate a worsening of the disease. This can be used to adjust
treatment plans accordingly and improve patient outcomes (McKinney, 2012; Reback
et al., 2020; Rossant, 2014).

23
An Exploration of Python Libraries in Machine Learning Models

As we will be implementing this sequence in Jupyter Notebook, we will first


import the necessary libraries in order to create the sequence using the dataset in
order to create the breast cancer sequence. The Breast Cancer Wisconsin (Diagnostic)
Data Set will be used with Scikit-Learn as it makes use of the Scikit-Learn library.
Comparative analysis of python libraries including other sources:

Table 9.

Library Pros Cons


Efficient numerical operations, multi-
Steeper learning curve for beginners, limited
NumPy dimensional arrays, broadcasting
support for string manipulation
capabilities
Powerful data manipulation and Memory-intensive for large datasets, slower
pandas analysis, easy integration with other performance compared to NumPy for numerical
libraries computations
Flexible data visualization options, Complex syntax for certain plot types, limited
Matplotlib
extensive customization capabilities interactivity compared to other libraries
Simplified syntax for statistical
Limited customization options, may require
seaborn visualization, aesthetically pleasing
additional libraries for advanced plotting features
default styles
Comprehensive machine learning
Limited support for deep learning, may require
scikit-learn algorithms and tools, well-documented
additional libraries for advanced techniques
API
Powerful deep learning framework, Steeper learning curve, verbosity in certain
TensorFlow support for neural networks, distributed operations, limited support for non-neural
computing network models
User-friendly deep learning library,
Less flexibility for complex model architectures,
Keras easy model prototyping, seamless
slower performance compared to TensorFlow
integration with TensorFlow
Dynamic computational graph, support
Limited deployment options compared to
PyTorch for neural networks, strong community
TensorFlow, slower performance on certain tasks
support
Database abstraction layer, supports
Steeper learning curve for complex queries, some
SQLAlchemy multiple database systems, ORM
performance overhead compared to raw SQL
capabilities
Full-featured web framework, built-in
Relatively heavy, may not be suitable for smaller
Django authentication and admin interface,
projects or microservices
scalable

Code to import the necessary libraries and load the dataset:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

24
An Exploration of Python Libraries in Machine Learning Models

import seaborn as sns


from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
df_cancer = pd.DataFrame(np.c_[cancer[‘data’],
cancer[‘target’]],
columns = np.append(cancer[‘feature_
names’],’target’]))
df_cancer.shape
sns.countplot(df_cancer[‘target’])

Figure 3.

plt.figure(figsize=(20,10))
sns.heatmap(df_cancer.corr(), annot=True)

25
An Exploration of Python Libraries in Machine Learning Models

Figure 4.

from sklearn.model_selection import train_test_split


from sklearn.svm import SVC
X = df_cancer.drop([‘target’], axis = 1)
y = df_cancer[‘target’]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_
size = 0.2, random_state = 5)
svc_model = SVC()
svc_model.fit(X_train, y_train)
from sklearn.metrics import classification_report, confusion_
matrix
y_predict = svc_model.predict(X_test)
cm = confusion_matrix(y_test, y_predict)
sns.heatmap(cm, annot=True)

26
An Exploration of Python Libraries in Machine Learning Models

Figure 5.

print(classification_report(y_test, y_predict))

Table 6.

Precision Recall F1-Score Support


0.0 1.00 0.85 0.92 48
1.0 0.90 1.00 0.95 66

accuracy 0.94 114


macro avg 0.95 0.93 0.94 114
weighted avg 0.94 0.94 0.94 114

breast cancer dataset differential gene analysis:

27
An Exploration of Python Libraries in Machine Learning Models

In this code, we first load the breast cancer dataset and normalize the data
using the StandardScaler function from scikit-learn. We then perform principal
component analysis (PCA) to reduce the dimensionality of the data to 2, and filter
for differentially expressed genes using the mean and quantile functions from pandas.
Finally, we perform clustering analysis using the KMeans function from scikit-learn
and visualize the results using seaborn.

# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
# Load breast cancer dataset
cancer = load_breast_cancer()
df = pd.DataFrame(cancer[‘data’], columns=cancer[‘feature_
names’])
# Normalize data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df)
# Perform PCA to reduce dimensions
pca = PCA(n_components=2)
pca_data = pca.fit_transform(scaled_data)
# Filter for differentially expressed genes
diff_exp_genes = df.columns[df.mean() > df.mean().
quantile(0.75)]
# Perform clustering analysis
kmeans = KMeans(n_clusters=2)
kmeans.fit(pca_data)
# Visualize results
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
sns.scatterplot(x=pca_data[:, 0], y=pca_data[:, 1],
hue=cancer[‘target’], palette=’coolwarm’)
plt.title(‘Actual Labels’)
plt.subplot(1, 2, 2)
sns.scatterplot(x=pca_data[:, 0], y=pca_data[:, 1], hue=kmeans.

28
An Exploration of Python Libraries in Machine Learning Models

labels_, palette=’coolwarm’)
plt.title(‘K-Means Clustering’)
plt.show()

Figure 6.

The challenges and limitations of using Python libraries in machine learning


models:
Scalability: When it comes to the scaling of machine learning applications,
Python’s Global Interpreter Lock (GIL) can be one of the biggest limitations. A
GIL ensures that only one thread can execute Python bytecode at a time. This can
hamper parallel execution and limit the use of multi-core processors because the GIL
ensures that only one thread can execute Python bytecode at a time. Thus, Python
may not be the most efficient choice for highly parallelizable tasks, although there are
workarounds such as using multiprocessing and offloading intensive computations
to a lower-level language in cases where this is possible.
Hardware acceleration support is limited: Python libraries like TensorFlow,
PyTorch, and scikit-learn provide interfaces for utilizing hardware acceleration
frameworks such as CUDA for GPUs, as well as Python libraries like TensorFlow,
PyTorch, and scikit-learn. Certain operations or custom models may, however, not
have GPU optimizations, resulting in a reduced performance on GPU-accelerated
systems when running certain operations or custom models. There are also libraries for
specialized hardware accelerators like TPUs that might have limited or experimental
support in Python.

29
An Exploration of Python Libraries in Machine Learning Models

6. CONCLUSION AND FUTURE WORK

As a conclusion, the exploration of Python libraries in the context of machine


learning models for data science has illustrated how many tools are available for
data scientists to build and analyze models using them. In order to efficiently and
effectively solve complex data science problems, researchers use libraries such as
NumPy, Pandas, and Scikit-Learn to manipulate data, TensorFlow and Matplotlib
to apply machine learning to the data, and Seaborn and Matplotlib to visualize the
data. Furthermore, the authors also point out that a number of other libraries and
algorithms are available which can be investigated in future work in addition to
the one we described in this research paper. Scikit-learn and TensorFlow are two
machine learning libraries that can be compared to PyTorch and OpenCV in terms
of their capabilities. Plotly and Altair can also be compared with Matplotlib and
Seaborn, which are also visualization libraries. The field of data science and machine
learning continues to provide new algorithms and libraries to be developed and
released every day, thanks to Python libraries, which are used for these purposes.
Future exploration and experimentation with these tools will likely benefit both
researchers and practitioners in the field in the future.

REFERENCES

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J. M., Devin, M.,
Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore,
S., Murray, D. G., Steiner, B., Tucker, P. A., Vasudevan, V., Warden, P., . . . Zheng, X.
(2016). TensorFlow: A system for large-scale machine learning. Operating Systems
Design and Implementation, 265–283. https://doi.org/ doi:10.5555/3026877.3026899
Chollet, F. (2017). Deep Learning with Python. http://cds.cern.ch/record/2301910
Cock, P. J. A., Antao, T., Chang, J. T., Chapman, B., Cox, C. J., Dalke, A., Friedberg,
I., Hamelryck, T., Kauff, F., Wilczyński, B., & De Hoon, M. (2009). Biopython:
Freely available Python tools for computational molecular biology and bioinformatics.
Bioinformatics (Oxford, England), 25(11), 1422–1423. doi:10.1093/bioinformatics/
btp163 PMID:19304878
Hunter, J. (2007). MatPlotLib: A 2D Graphics environment. Computing in Science
& Engineering, 9(3), 90–95. doi:10.1109/MCSE.2007.55
Jarvis, R. M., Broadhurst, D., Johnson, H. E., O’Boyle, N. M., & Goodacre, R. (2006).
PYCHEM: A multivariate analysis package for python. Bioinformatics (Oxford,
England), 22(20), 2565–2566. doi:10.1093/bioinformatics/btl416 PMID:16882648

30
An Exploration of Python Libraries in Machine Learning Models

McKinney, W. (2010). Data structures for statistical computing in Python. Proceedings


of the Python in Science Conferences. 10.25080/Majora-92bf1922-00a
McKinney, W. (2011). Pandas: A Foundational Python Library for Data Analysis
and Statistics. Python High Performance Science Computer. https://www.dlr.de/
sc/en/Portaldata/15/Resources/dokumente/pyhpc2011/submissions/pyhpc2011_
submission_9.pdf
McKinney, W. (2012). Python for data analysis. O’Reilly Media, Inc. eBooks. http://
ci.nii.ac.jp/ncid/BB11531826
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O.,
Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A.,
Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). SciKit-Learn:
Machine Learning in Python. HAL (Le Centre Pour La Communication Scientifique
Directe). https://hal.inria.fr/hal-00650905
Reback, J., & McKinney, W. (2020). pandas-dev/pandas: Pandas 1.0.5. Zenodo.
doi:10.5281/zenodo.3898987
RebackJ.McKinneyW.Van Den BosscheJ.AugspurgerT.CloudP.KleinA.SeaboldS.
(2020). pandas-dev/pandas: Pandas 1.0. 5. Zenodo.
Rossant, C. (2014). IPython Interactive Computing and Visualization Cookbook.
https://scholarvox.library.inseec-u.com/catalog/book/docid/88851238?_locale=fr
Satyanarayan, A., Moritz, D., Wongsuphasawat, K., & Heer, J. (2017). Vega-Lite: A
grammar of interactive Graphics. IEEE Transactions on Visualization and Computer
Graphics, 23(1), 341–350. doi:10.1109/TVCG.2016.2599030 PMID:27875150
Vanderplas, J. (2016). Python Data Science Handbook: Essential Tools for Working
with Data. http://cds.cern.ch/record/2276771
Waskom, M. (2021). seaborn: Statistical data visualization. Journal of Open Source
Software, 6(60), 3021. doi:10.21105/joss.03021
Wickham, H. (2009). ggplot2: Elegant Graphics for Data Analysis. http://ndl.
ethernet.edu.et/bitstream/123456789/60263/1/107.pdf

31

You might also like