Bioinformatics with Python Cookbook

Computer Specifications and Python Setup

We will start by installing the basic software that is required for most of this book. This will include the Python distribution, some fundamental Python libraries, and our Jupyter Notebook environment. We will also set up our GitHub environment and gain access to the repository for the book. As different users have different requirements, we will cover two different approaches for installing the software. One approach is using the Anaconda Python (http://docs.continuum.io/anaconda/) distribution and the other is via Docker (a server virtualization method based on containers sharing the same operating system kernel; please refer to https://www.docker.com/). This will still install Anaconda for you but inside a container.

If you are using a Windows-based operating system, you are strongly encouraged to consider changing your operating system or using Docker via some of the existing options on Windows. On macOS, you should be able to install most of the software natively, though Docker is also available. Learning using a local distribution (Anaconda or something else) is easier than Docker, but given that package management can be complex in Python, Docker images provide a level of stability.

Most modern data scientists use a Mac due to the ease with which you can interact with a native Linux-style operating system. We recommend using a similar computer for this book. In the Technical requirements section, we provide the specifications of the computer and libraries used to develop this book. In most cases, deviations from such a system should work fine with minimal modifications, but if you have trouble, you can try the Docker container. Another alternative could be to use a cloud workstation (some options follow).

In this chapter, we will cover the following recipes:

Installing the required software with Anaconda
Installing the required software with Docker
Introduction to Jupyter Notebook

In this chapter, we will first install some prerequisite software – details of which are given in the Technical requirements section. Each recipe will then take you through the software and the steps that are needed to install it. Each chapter and section might have extra requirements on top of these – we will make those clear as the book progresses. An alternative way to start is to use the Installing the required software with Docker recipe, after which everything will be taken care of for you via a Docker container.

Installing the required basic software with Anaconda

Next, we will begin setting up your required software libraries, including Python itself. If you are already using a different Python distribution, you are strongly encouraged to consider Anaconda, as it has become the de facto standard for data science and bioinformatics. Also, it is the distribution that will allow you to install software from bioconda (https://bioconda.github.io/).

Getting ready

Python can be run on top of different environments. For instance, you can use Python inside the Java Virtual Machine (JVM) (via Jython or with .NET via IronPython). However, here, we are not only concerned with Python but also with the complete software ecology around it. Therefore, we will use the standard (CPython) implementation, since the JVM and .NET versions exist mostly to interact with the native libraries of these platforms.

For our code, we will be using Python 3.12. If you were starting with Python and bioinformatics, any operating system would work. But here, we are mostly concerned with intermediate to advanced usage, and so we will focus on macOS.

If you are on Windows and do not have easy access to macOS or Linux, don’t worry. Modern virtualization software (such as VirtualBox and Docker) will come to your rescue, which will allow you to install a virtual OS on your operating system. Another option is to use Windows Subsystem for Linux (WSL2), which allows you to run Linux on Windows. For documentation on WSL2, look here:

Another option for you will be to use a cloud workstation (see the Technical requirements section).

Bioinformatics and data science are moving at breakneck speed; this is not just hype, it’s a reality. When installing software libraries, choosing a version might be tricky. Depending on the code that you have, it might not work with some old versions or perhaps not even work with a newer version. Hopefully, any code that you use will indicate the correct dependencies – though this is not guaranteed. In this book, we will fix the precise versions of all software packages, (or provide you with a minimal version, or specify one in the associated chapter YAML file as appropriate. Check your chapter’s README.md file or the Updates section of each notebook for more information.) and we will make sure that the code will work with them. It is quite natural that the code might need tweaking with other package versions.

The software developed for this book is available at https://github.com/PacktPublishing/Bioinformatics-with-Python-Cookbook-fourth-edition. To access it, you will need to install Git. First, make sure HomeBrew is installed (https://brew.sh/):

brew install git

You can go to the GitHub page for the book and get the HTTPS link for downloading the source:

git clone https://github.com/PacktPublishing/Bioinformatics-with-Python-Cookbook-Fourth-Edition.git

This will download the source code to your computer.

Before you install the Python stack, you will need to install all of the external non-Python software that you will be interoperating with. The list will vary from chapter to chapter, and all chapter-specific packages will be explained in their respective chapters. Most of the software is available via bioconda (https://bioconda.github.io/) (also called conda for short) or is pip installable (https://pypi.org/project/pip/).

Where possible in this book, we will allow you to do everything from your Jupyter notebook, even installing the software. To do this, we will use the ! command, which allows you to run a command that you would normally run from your Terminal from the notebook instead – for example:

! ls

This will run the ls or list directory command as if it had been run from the Terminal.

In some cases, for more involved installations, you will need to go into the Terminal, but we’ll advise you on how to do those steps as we go through the relevant recipes.

You will need to install some development compilers and libraries, all of which are free. On Ubuntu, consider installing the build-essential package (apt-get install build-essential), and on macOS, consider Xcode (https://developer.apple.com/xcode/).

We will mention many amazing Python libraries in this book, but here is a brief overview of some of the most important ones:

Name	Application	URL	Purpose
Biopython	All chapters	https://biopython.org/	Bioinformatics library
Biotite	Protein Design	https://www.biotite-python.org/latest/index.html	MultiTool and Protein Structure
Cython	Big data	http://cython.org/	High performance
Dask	Big data	http://dask.pydata.org	Parallel processing
DendroPY	Phylogenetics	https://dendropy.org/	Phylogenetics
HTSeq	NGS/Genomes	https://htseq.readthedocs.io	NGS processing
jupytext	Notebook conversion	https://jupytext.readthedocs.io/en/latest/	Convert your notebook to Python text
Keras	Deep Learning	https://keras.io/	Higher-level library for ML
Matplotlib	Visualization	https://matplotlib.org/	Graphing library
NumPy	All chapters	http://www.numpy.org/	Array/matrix processing
Numba	Big data	https://numba.pydata.org/	High performance
Project Jupyter	All chapters	https://jupyter.org/	Interactive computing
PyMol	Proteomics	https://pymol.org	Molecular visualization
PyVCF	NGS	https://pyvcf.readthedocs.io	VCF processing
Pysam	NGS	https://github.com/pysam-developers/pysam	SAM/BAM processing
SciPy	All chapters	https://www.scipy.org/	Scientific computing
TensorFlow	Machine learning	https://www.tensorflow.org/	Machine learning library
pandas	All chapters	https://pandas.pydata.org/	Data processing
scikit-learn	Machine learning	https://scikit-learn.org	Machine learning library
seaborn	All chapters	https://seaborn.pydata.org/	Statistical chart library

Table 1.1 – Major Python packages that are useful in bioinformatics

We will use pandas to process most table data.

How to do it...

To get started, take a look at the following steps:

Start by downloading the Anaconda distribution from https://www.anaconda.com/products/individual. We will be using version 2024.06, although you will probably be fine with the most recent one. You can accept all the installation’s default settings, but you might want to make sure that the conda binaries are in your path (do not forget to open a new window so that the path can be updated).

As an alternative to downloading from the website, you can use this command:

curl -O https://repo.anaconda.com/archive/Anaconda3-2024.06-1-MacOSX-x86_64.sh

If you have another Python distribution, be careful with PYTHONPATH and existing Python libraries. It’s probably better to unset PYTHONPATH. As much as possible, uninstall all other Python versions and installed Python libraries. These steps will help reduce future confusion about which installation of Python you are pointing to.
Let’s go ahead with the libraries. We will now create a new conda environment called bioinformatics_base with biopython=1.84, as shown in the following command (type it in your Terminal):
```
conda create -n bioinformatics_base python=3.12
```
Let’s activate the environment, as follows:
```
conda activate bioinformatics_base
```
Let’s add the bioconda and conda-forge channels to our source list:
```
conda config --add channels bioconda
conda config --add channels conda-forge
```
Note: Conda channels are remote hosting locations that store common packages we may need.
Also, install the basic packages:
```
! conda install biopython==1.84 jupyterlab==4.3.0 matplotlib==3.9.2 numpy==2.1.0 pandas==2.2.3 scipy==1.14.1
```
As an alternative to the above, you can also set up your conda environment using a file that specifies the packages needed. It is provided as bioinformatics_base.yml. It is a YAML file, which stands for "YAML Ain't Markup Language" (https://yaml.org/ To use the file run this command:
```
conda env create –f ~/work/CookBook/Ch01/bioinformatics_base.yml
```
This will install the required packages for you.

Tip

We often install the latest version of the package by just typing something like conda install biopython, but in this book, we will often do something called “pinning the version.” This means we write an explicit version to help with the reproducibility of the code. We won’t pin the version in every example throughout the book. In most cases, your code should work fine with the latest version. However, we’ll include version pinning where it’s necessary. If any version-specific issues arise in the future, notes will be added to the README.md file for each chapter and in the Updates section of the corresponding notebook.

Now, let’s save our environment so that we can reuse it later to create new environments in other machines or if you need to clean up the base environment:
```
conda list –e > bioinformatics_base.txt
```

Tip

On the left side of your Terminal, you will see what Anaconda environment you are in so you can always tell where you are at. For instance, right now, it should say (bioinformatics_base).

One thing that can be confusing is that using the python -V command in this environment could show an older version. This is because Python 3 is referred to via the python3 command. To fix this, you want to alias the Python command. Typically, it is easiest to put this in your shell file, which is a file that is always run when you open a Terminal window. In Linux, it was .bashrc, but on macOS, you will use the .zshrc file (often pronounced z-shark).

Solution: Open your ~/.zshrc file in a text editor

Add the following line to the end of the file:

alias python=python3

Now save it.

To run it, you can type source ~/.zshrc.

Now, when you run python -V or python --version, you should see that it is 3.12. If you are in a notebook and want to double-check your version, you can run ! python -V in a cell.

There’s more...

If you prefer not to use Anaconda, you will be able to install many of the Python libraries via pip using whatever distribution you choose. You can go through the book and keep installing packages in bioinformatics_base if you want. But you may, at times, find that you want to create an environment specific to a particular chapter to help isolate any complexity in package installations. Let’s look at how to do that real quick:

For example, imagine you want to create an environment for machine learning with scikit-learn. You can do the following:

First, we need to deactivate our current environment:
```
conda deactivate
```
Create a clone of the original environment with the following:
```
conda create -n scikit-learn --clone bioinformatics_base
```

Add scikit-learn:

conda activate scikit-learn
conda install scikit-learn

Installing the required software with Docker

Docker is the most widely used framework for implementing operating system-level virtualization. This technology allows you to have an independent container: a layer that is lighter than a virtual machine but still allows you to compartmentalize software. This mostly isolates all processes, making it feel like each container is a virtual machine. Containers will be discussed in more detail in Chapter 14, Cloud Basics.

Docker works quite well at both extremes of the development spectrum: it’s an expedient way to set up the content of this book for learning purposes and could become your platform of choice for deploying your applications in complex environments.

Conda and Docker are key tools to help maintain software compatibility and reproducibility across different systems and libraries. We’ll discuss reproducibility more in Chapter 15, Workflow Systems.

Note

This recipe is an alternative to the previous recipe. Normally, if you have a Mac and are using it for your Jupyter notebooks, you will not need the Docker container. If you have a Windows machine or cannot get certain code to work in your environment, the Docker container can be useful to provide you with an environment that is set up properly already for you.

Getting ready

The first thing you have to do is install Docker. Go to https://www.docker.com/. Install Docker Desktop for your appropriate operating system (remember to check the Apple versus Intel silicon discussion in the Technical requirements section if you are using macOS). You’ll also need to sign up for a Docker account and record your username and password.

How to do it...

Docker Desktop must be running and you need to be signed in before downloading the Docker file. To get started, follow these steps:

Use the following command from your Terminal:

docker build -t bio https://github.com/PacktPublishing/Bioinformatics-with-Python-Cookbook-Fourth-Edition.git#main:docker/main

Tip

If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book’s GitHub repository. Doing so will help you avoid any potential errors related to the copying and pasting of code.

You can find the commands for this section in the chapter’s README.md file.

Now you are ready to run the container, as follows:

docker run -ti -p 9875:9875 -v YOUR_DIRECTORY:/data bio

Replace YOUR_DIRECTORY with a directory on your operating system. This will be shared between your host operating system and the Docker container. YOUR_DIRECTORY will be seen in the container in /data and vice versa.
In this case, -p 9875:9875 will expose the container’s TCP port 9875 on the host computer port, 9875.
Especially on Windows (and maybe on macOS), make sure that your directory is actually visible inside the Docker shell environment. If not, check the official Docker documentation on how to expose directories. To access the Docker image while it’s running, hover over the Docker Desktop icon. All the files available in the book’s GitHub repository will be mirrored in the Docker image.
Now you are ready to use the system. Point your browser to http://localhost:9875 and you should get the Jupyter environment.

If this does not work on Windows, check the official Docker documentation (https://docs.docker.com/manuals/) on how to expose ports.

Introduction to Jupyter Notebook

All of our work will be developed inside Jupyter Notebook. Jupyter has become the de facto standard for writing interactive data analysis scripts. Unfortunately, the default format for Jupyter notebooks is based on JSON. JSON is JavaScript Object Notation (https://www.json.org/json-en.html). This format is difficult to read, difficult to compare, and needs exporting to be fed into a normal Python interpreter. To obviate that problem, we will extend Jupyter with jupytext (https://jupytext.readthedocs.io/), which allows us to save Jupyter notebooks as normal Python programs. We will start with an overview of Jupyter Notebook, and then look into jupytext. Recall that we installed Jupyter Notebook in the first recipe of this chapter, when we installed the jupyterlab package using conda.

How to do it…

To run Jupyter, on the Terminal, type the following:
```
jupyter notebook
```
This will open the Jupyter browser, and you will see a home page that looks something like this:

Figure 1.1 – The Jupyter browser home page

This home page gives you an overview of your files, so you can open, rename, and download them, and so on.

Let’s click on one of the files and open it. We will see something like this:

Figure 1.2 – An example of a notebook

Here, we see a menu that allows us to save or download files and perform other actions. Each cell can be executed by clicking the play button. You can also run multiple cells. When you run a cell, you will see its output below.

In some cases, you may need to restart your kernel – use the Kernel | Restart Kernel... method.

Jupyter notebook resources

This would be a good time to pause and take some time to learn more about Jupyter notebooks. There are numerous keyboard shortcuts that are worth learning to speed up your development:

Tutorial: https://www.datacamp.com/tutorial/tutorial-jupyter-notebook

Keyboard Shortcuts: https://towardsdatascience.com/jypyter-notebook-shortcuts-bf0101a98330

Now that we have set up our Jupyter Notebook environment, let’s take a look at a handy tool called Jupytext.

Jupytext

Sometimes, you will want to convert your notebooks into formats other than ipynb – for example, you might want to get them into .py format. For this, we can use jupytext - https://github.com/mwouts/jupytext. This handy plugin will allow us to save Jupyter notebooks in formats other than .ipynb. Remember to get out of the Jupyter browser first. To do this on a Mac, you would close the Jupyter browser window, then go to the Terminal where you started it. Then, click Ctrl + Z to kill the process.

To install jupytext, we will run the following:

pip install jupytext

Now let’s start up the Jupyter browser again:

jupyter notebook

Now we’ve launched the Jupyter browser again, open the Welcome notebook. Go to the File | Jupytext menu. Here is what it looks like:

Figure 1.3 – The Jupytext menu within the Jupyter browser

To save your notebook in another format, you pair it and choose a format. For instance, if you choose to pair it with the light format, you will get a regular Python (.py) formatted file in your current working directory.

Here is what our Welcome.py file looks like in our working directory when paired with the light format:

Figure 1.4 – The Welcome.py notebook in the Light format produced by Jupytext

There are several other popular formats supported by Jupytext. You can read more about them here: https://jupytext.readthedocs.io/en/latest/index.html.

Warning

Remember that the recipes in this book are normally meant to be run inside Jupyter notebooks. This means, typically, we will not always use print to output content. In a notebook, if you simply put the name of a variable and run it, it will print out the result for you. If you are not using notebooks (e.g. you are writing Python scripts and executing them from the terminal), you may want to add print statements to your code. Even within a notebook, you may find it useful at times to add your own print statements to inspect variables and debug code.

In addition to the Jupyter browser, there is a more integrated environment called JupyterLab: https://jupyterlab.readthedocs.io/en/latest/. It allows you to run Terminals and other widgets inside the same environment as your notebook. To get to it, you can click View | Open JupyterLab. You can check it out if you are interested, but it is not necessary to get through the book.

A welcome notebook called Welcome.ipynb has been placed in the GitHub repository for this book in the Ch01 folder. You can use it to test out your notebook environment. This notebook also contains many handy links to help you learn Python and explore bioinformatics!

To recap everything, here are your main options for performing the recipes in this book:

System	Components	Pros	Cons
MacBook Pro Laptop	Anaconda; pip; brew; Jupyter	Best system for compatibility and ease of use	You may not own one
Mac Cloud Workstation or Mac AWS EC2 Instance	Anaconda; pip; brew; Jupyter	Convenient solution; identical to Mac laptop	May incur some costs
Windows Machine + Docker	Docker	Portable solution	Some increased overhead
Windows + VirtualBox or WSL2	Anaconda; pip; brew; Jupyter	Let’s you interact with a Linux OS	Some installation or compatibility issues may arise
Linux Machine	Anaconda; pip; brew; Jupyter	Let’s you interact with a Linux OS	Some installation or compatibility issues may arise

Table 1.2 – System and OS options for use with this book

Get This Book's PDF Version and Exclusive Extras

Scan the QR code (or go to packtpub.com/unlock). Search for this book by name, confirm the edition, and then follow the steps on the page.

Note: Keep your invoice handy. Purchases made directly from Packt don’t require an invoice.

Bioinformatics with Python Cookbook: Solve advanced computational biology problems and build production pipelines with Python and AI tools , Fourth Edition

What do you get with Print?

Bioinformatics with Python Cookbook

Computer Specifications and Python Setup

Technical requirements

Installing the required basic software with Anaconda

Getting ready

How to do it...

There’s more...

See Also

Installing the required software with Docker

Getting ready

How to do it...

See also

Introduction to Jupyter Notebook

How to do it…

Jupytext

See also

Get This Book's PDF Version and Exclusive Extras

Page 1 of 5

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with Print?

Product Details

Table of Contents

Recommendations for you

About the author

FAQs

Bioinformatics with Python Cookbook: Solve advanced computational biology problems and build production pipelines with Python and AI tools , Fourth Edition

What do you get with Print?

Contact Details

Shipping Address

Billing Address

Get This Book's PDF Version and Exclusive Extras

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with Print?

Contact Details

Shipping Address

Billing Address

Product Details

Packt Subscriptions

Table of Contents

Recommendations for you

About the author

FAQs

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access