Soil Metagenome Pipeline
Soil Metagenome Pipeline is a modular, Nextflow DSL2 workflow for assembling, polishing, binning, annotating, and functionally characterizing complex soil metagenomes. It orchestrates state-of-the-art tools for long- and short-read metagenomics, generates high-quality MAGs, assigns taxonomy, and screens for biosynthetic gene clusters (BGCs).
What it does
- Assembles long-read metagenomes (e.g., ONT) with Flye and optionally polishes with Medaka and/or NextPolish using short reads.
- Maps short/long reads to assemblies to compute coverage/depth for downstream binning and QC.
- Bins contigs with multiple strategies (SemiBin2, VAMB, MetaCoAG, ComeBIN) and can integrate results.
- Evaluates MAG quality with CheckM2 and assigns taxonomy with GTDB-Tk.
- Annotates bins and/or assemblies (Bakta, eggNOG) and detects BGCs (antiSMASH) with network-based clustering (BiG-SCAPE).
- Produces organized outputs suitable for downstream comparative genomics.
Key features
- Modular DSL2 design: swap/extend modules under
modules/
andsubmodules/
. - Reproducible runtime via Conda/containers (profiles in
conf/
). - Sensible defaults with overridable parameters via
nextflow.config
or CLI. - Caching and resumability: supports
-resume
for efficient re-runs.
Modules at a glance (non-exhaustive)
- Assembly and polishing: Flye, Medaka, NextPolish
- Coverage mapping: minimap2/samtools, coverm, strobealign
- Binning: SemiBin2, VAMB, MetaCoAG, ComeBIN, plus bin collection utilities
- QC and taxonomy: CheckM2, GTDB-Tk
- Annotation and function: Bakta (assemblies/bins), eggNOG
- BGC discovery: antiSMASH (assemblies/bins), BiG-SCAPE networks
- Taxonomic profiling: MMseqs2/MetaBuli helpers
Inputs
- Reads: long reads (ONT/PacBio), optional short reads (Illumina).
- Sample sheet: a tab-separated file like
data/samples.tsv
describing sample IDs and file paths. - Reference databases: external DBs required by some tools (e.g., GTDB-Tk, antiSMASH, BiG-SCAPE) are not bundled. Configure their locations via params or environment as appropriate.
Quick start
-
Dry run / graph preview: nextflow run . -dsl2 -preview
-
Example execution (adjust paths and profile to your environment): nextflow run . -profile conda -resume
--reads '/path/to/_{R1,R2}.fastq.gz'
--longreads '/path/to/.fastq.gz'
--samples 'data/samples.tsv'
--outdir 'results'
See conf/
for example profiles (conda, docker, singularity, slurm). Tune resources via nextflow.config
using withName:
blocks for process-specific CPU, memory, and time.
Citation
If you use Soil Metagenome Pipeline in your research, please cite the corresponding preprint:
- bioRxiv abstract: https://www.biorxiv.org/content/10.1101/2025.05.28.656579v1.abstract
- DOI: https://doi.org/10.1101/2025.05.28.656579
A machine-readable citation file (CITATION.cff) is included in the repository root. GitHub will display a "Cite this repository" button.
License
This project is licensed under the GNU General Public License v3.0 or later (GPL-3.0-or-later). See the LICENSE file for the full text.
Version History

Creator
Submitter
Views: 420 Downloads: 43
Created: 19th Sep 2025 at 19:34

None