Skip to content
Lucas De Vrieze edited this page Jan 15, 2026 · 11 revisions

The problem

Genome mining tools largely benefit from the size and diversity of their target databases. However, current large target databases (such as NCBI RefSeq) often contain a high level of redundancy. Popular BLAST-based genome mining tools such as cblaster and CAGECAT propagate this redundancy into their hit sets, which tends to overcomplicate downstream analyses, such as visualisations using clinker. These redundant hit sets are often filtered manually or by gene cluster-level clustering using BiG-SCAPE. However, a manual curation is often too crude, cutting away gene cluster diversity, while curation using BiG-SCAPE ignores HGT events and other evolutionary events in which the gene cluster evolves differently than the overall genome or the direct genomic neighbourhood.

The solution

CAGEcleaner is a redundancy removal tool for gene clusters that dereplicates hits at the full genome level, (in development) or at the level of the direct genomic neighbourhood of the hit. In addition, it can recover the more diverse hits from non-representative assemblies based on an assessment of gene cluster contents and homology scores.

More specifically, in full genome mode, it retrieves the full genome assemblies of the hit's host genomes, and then performs a speedy ANI-based full genome dereplication using skDER. Finally, it couples the dereplicated assemblies back to their corresponding gene cluster hits.

In region mode, instead of dereplicating the full genomes assemblies, it extracts each hit with an additional sequence margin on both sides from the assembly, and then dereplicates the extracted nucleotide sequences using MMseqs2.

Some caveats

CAGEcleaner does not come with its own genome QC module. Low-quality assemblies may be retained unnecessarily, resulting in a less efficient dereplication.

CAGEcleaner has primarily been developed as an auxiliary tool to be used in interaction with cblaster and/or CAGECAT. As such, it expects and produces files that can be processed by these tools. In principle, it should be able to process outputs from other tools if this output has been cast to the cblaster binary table format.

Where to begin?

Head over to the Installation page to get CAGEcleaner up and running. The How-to page will then show you how to clean up your hit sets.

Citations

If you found CAGEcleaner useful, please cite our manuscript:

De Vrieze, L., Biltjes, M., Lukashevich, S., Tsurumi, K., Masschelein, J. (2025) CAGEcleaner: reducing genomic redundancy in gene cluster mining. Bioinformatics, Volume 41, Issue 7, https://doi.org/10.1093/bioinformatics/btaf373

CAGEcleaner relies heavily on the skDER genome dereplication tool and its main dependency skani, so please give these proper credit as well.

Salamzade, R., & Kalan, L. R. (2025). skDER and CiDDER: two scalable approaches for microbial genome dereplication. Microbial Genomics, 11(7), https://doi.org/10.1099/mgen.0.001438
Shaw, J., & Yu, Y. W. (2023). Fast and robust metagenomic sequence comparison through sparse chaining with skani. Nature Methods, 20(11), 1661–1665. https://doi.org/10.1038/s41592-023-02018-3
Steinegger, M., & Söding, J. (2017). MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, 35, https://doi.org/10.1038/nbt.3988

Clone this wiki locally