Subspecies of the human gut microbiota carry implicit information for in-depth microbiome research
This repository provides all data and code needed to reproduce the results from our publication.
The code is organized into:
Snakemake pipelines used in the study:
Quality-filtering– MAG filtering using GUNC and BUSCOsourmash-compare– Subspecies delineationuse_catalog– Subspecies quantification from metagenomesassign-subspecies– Assigning new genomes to HuMSub clustersget_specific_sequences– Identifying subspecies-specific genes
Jupyter notebooks for:
- Statistical and machine learning analyses
- Figure generation
Notebooks are grouped by topic, matching the publication’s structure. Each includes all necessary data for re-execution.
The catalogue is available in two prebuilt sourmash SBT index formats:
| File | Use Case |
|---|---|
HuMSub_51_1000.sbt.zip |
General subspecies quantification (k=51) |
HuMSub_21_1000.sbt.zip |
Mastiff database queries (k=21) |
Download from Zenodo:
https://zenodo.org/records/15862096
For benchmarking and testing:
| File | Description |
|---|---|
humgut_samples.tar.gz |
Simulated paired-end reads from HumGut genomes |
new_samples.tar.gz |
Simulated paired-end reads from genomes outside of HumGut |
Corresponding taxonomic distributions are available in Scripts/benchmark/.
The HuMSub catalogue includes genomes from some non-gut-associated phyla (e.g., Elusimicrobiota, Eremiobacteriota, Patescibacteria) retained from the original HumGut reference.
Although these were not detected in CRC datasets, they were preserved for completeness based on genome quality scores. Use with caution in downstream interpretation.
- Install Snakemake (https://snakemake.readthedocs.io) version 7
- Clone this repository
- Modify the appropriate
config.yamlfor each workflow - Run with:
snakemake -s workflow/Snakefile --configfile config/config.yaml --use-conda --cores 4This directory lists essential external files required to reproduce the results of the HuMSub study. These resources are hosted externally due to their size and licensing constraints.
- File:
All_genomes.tsvin Scripts/benchmark - Description: Metadata and download links for all genomes in the HumGut catalog
- Zenodo Record:
https://zenodo.org/records/15862096
This includes:
HuMSub_51_1000.sbt.zip– k=51, scaled=1000 (subspecies quantification)HuMSub_21_1000.sbt.zip– k=21, scaled=1000 (Mastiff queries)humgut_samples.tar.gz– simulated reads from HumGut genomesnew_samples.tar.gz– simulated reads from genomes outside HumGut
After download, place files into the appropriate subdirectories (e.g. resources/, test_data/).
Some files may be automatically downloaded by Snakemake rules if not found in the expected locations. Refer to the main README.md for pipeline instructions.