Next, we will begin setting up your required software libraries, including Python itself. If you are already using a different Python distribution, you are strongly encouraged to consider Anaconda, as it has become the de facto standard for data science and bioinformatics. Also, it is the distribution that will allow you to install software from bioconda (https://bioconda.github.io/).
Getting ready
Python can be run on top of different environments. For instance, you can use Python inside the Java Virtual Machine (JVM) (via Jython or with .NET via IronPython). However, here, we are not only concerned with Python but also with the complete software ecology around it. Therefore, we will use the standard (CPython) implementation, since the JVM and .NET versions exist mostly to interact with the native libraries of these platforms.
For our code, we will be using Python 3.12. If you were starting with Python and bioinformatics, any operating system would work. But here, we are mostly concerned with intermediate to advanced usage, and so we will focus on macOS.
If you are on Windows and do not have easy access to macOS or Linux, don’t worry. Modern virtualization software (such as VirtualBox and Docker) will come to your rescue, which will allow you to install a virtual OS on your operating system. Another option is to use Windows Subsystem for Linux (WSL2), which allows you to run Linux on Windows. For documentation on WSL2, look here:
Another option for you will be to use a cloud workstation (see the Technical requirements section).
Bioinformatics and data science are moving at breakneck speed; this is not just hype, it’s a reality. When installing software libraries, choosing a version might be tricky. Depending on the code that you have, it might not work with some old versions or perhaps not even work with a newer version. Hopefully, any code that you use will indicate the correct dependencies – though this is not guaranteed. In this book, we will fix the precise versions of all software packages, (or provide you with a minimal version, or specify one in the associated chapter YAML file as appropriate. Check your chapter’s README.md file or the Updates section of each notebook for more information.) and we will make sure that the code will work with them. It is quite natural that the code might need tweaking with other package versions.
The software developed for this book is available at https://github.com/PacktPublishing/Bioinformatics-with-Python-Cookbook-fourth-edition. To access it, you will need to install Git. First, make sure HomeBrew is installed (https://brew.sh/):
brew install git
You can go to the GitHub page for the book and get the HTTPS link for downloading the source:
git clone https://github.com/PacktPublishing/Bioinformatics-with-Python-Cookbook-Fourth-Edition.git
This will download the source code to your computer.
Before you install the Python stack, you will need to install all of the external non-Python software that you will be interoperating with. The list will vary from chapter to chapter, and all chapter-specific packages will be explained in their respective chapters. Most of the software is available via bioconda (https://bioconda.github.io/) (also called conda for short) or is pip installable (https://pypi.org/project/pip/).
Where possible in this book, we will allow you to do everything from your Jupyter notebook, even installing the software. To do this, we will use the ! command, which allows you to run a command that you would normally run from your Terminal from the notebook instead – for example:
! ls
This will run the ls or list directory command as if it had been run from the Terminal.
In some cases, for more involved installations, you will need to go into the Terminal, but we’ll advise you on how to do those steps as we go through the relevant recipes.
You will need to install some development compilers and libraries, all of which are free. On Ubuntu, consider installing the build-essential package (apt-get install build-essential), and on macOS, consider Xcode (https://developer.apple.com/xcode/).
We will mention many amazing Python libraries in this book, but here is a brief overview of some of the most important ones:
Table 1.1 – Major Python packages that are useful in bioinformatics
We will use pandas to process most table data.
How to do it...
To get started, take a look at the following steps:
- Start by downloading the Anaconda distribution from https://www.anaconda.com/products/individual. We will be using version 2024.06, although you will probably be fine with the most recent one. You can accept all the installation’s default settings, but you might want to make sure that the
conda binaries are in your path (do not forget to open a new window so that the path can be updated).
- As an alternative to downloading from the website, you can use this command:
curl -O https://repo.anaconda.com/archive/Anaconda3-2024.06-1-MacOSX-x86_64.sh
- If you have another Python distribution, be careful with PYTHONPATH and existing Python libraries. It’s probably better to unset PYTHONPATH. As much as possible, uninstall all other Python versions and installed Python libraries. These steps will help reduce future confusion about which installation of Python you are pointing to.
- Let’s go ahead with the libraries. We will now create a new
conda environment called bioinformatics_base with biopython=1.84, as shown in the following command (type it in your Terminal):conda create -n bioinformatics_base python=3.12
- Let’s activate the environment, as follows:
conda activate bioinformatics_base
- Let’s add the
bioconda and conda-forge channels to our source list:conda config --add channels bioconda
conda config --add channels conda-forge
Note: Conda channels are remote hosting locations that store common packages we may need.
- Also, install the basic packages:
! conda install biopython==1.84 jupyterlab==4.3.0 matplotlib==3.9.2 numpy==2.1.0 pandas==2.2.3 scipy==1.14.1
As an alternative to the above, you can also set up your conda environment using a file that specifies the packages needed. It is provided as bioinformatics_base.yml. It is a YAML file, which stands for "YAML Ain't Markup Language" (https://yaml.org/ To use the file run this command:
conda env create –f ~/work/CookBook/Ch01/bioinformatics_base.yml
This will install the required packages for you.
Tip
We often install the latest version of the package by just typing something like conda install biopython, but in this book, we will often do something called “pinning the version.” This means we write an explicit version to help with the reproducibility of the code. We won’t pin the version in every example throughout the book. In most cases, your code should work fine with the latest version. However, we’ll include version pinning where it’s necessary. If any version-specific issues arise in the future, notes will be added to the README.md file for each chapter and in the Updates section of the corresponding notebook.
- Now, let’s save our environment so that we can reuse it later to create new environments in other machines or if you need to clean up the base environment:
conda list –e > bioinformatics_base.txt
Tip
On the left side of your Terminal, you will see what Anaconda environment you are in so you can always tell where you are at. For instance, right now, it should say (bioinformatics_base).
One thing that can be confusing is that using the python -V command in this environment could show an older version. This is because Python 3 is referred to via the python3 command. To fix this, you want to alias the Python command. Typically, it is easiest to put this in your shell file, which is a file that is always run when you open a Terminal window. In Linux, it was .bashrc, but on macOS, you will use the .zshrc file (often pronounced z-shark).
Solution: Open your ~/.zshrc file in a text editor
Add the following line to the end of the file:
alias python=python3
Now save it.
To run it, you can type source ~/.zshrc.
Now, when you run python -V or python --version, you should see that it is 3.12. If you are in a notebook and want to double-check your version, you can run ! python -V in a cell.
There’s more...
If you prefer not to use Anaconda, you will be able to install many of the Python libraries via pip using whatever distribution you choose. You can go through the book and keep installing packages in bioinformatics_base if you want. But you may, at times, find that you want to create an environment specific to a particular chapter to help isolate any complexity in package installations. Let’s look at how to do that real quick:
For example, imagine you want to create an environment for machine learning with scikit-learn. You can do the following:
- First, we need to deactivate our current environment:
conda deactivate
- Create a clone of the original environment with the following:
conda create -n scikit-learn --clone bioinformatics_base
- Add
scikit-learn:conda activate scikit-learn
conda install scikit-learn
See Also