# Soil Metagenome Pipeline Soil Metagenome Pipeline is a modular, Nextflow DSL2 workflow for assembling, polishing, binning, annotating, and functionally characterizing complex soil metagenomes. It orchestrates state-of-the-art tools for long- and short-read metagenomics, generates high-quality MAGs, assigns taxonomy, and screens for biosynthetic gene clusters (BGCs). ## What it does - Assembles long-read metagenomes (e.g., ONT) with Flye and optionally polishes with Medaka and/or NextPolish using short reads. - Maps short/long reads to assemblies to compute coverage/depth for downstream binning and QC. - Bins contigs with multiple strategies (SemiBin2, VAMB, MetaCoAG, ComeBIN) and can integrate results. - Evaluates MAG quality with CheckM2 and assigns taxonomy with GTDB-Tk. - Annotates bins and/or assemblies (Bakta, eggNOG) and detects BGCs (antiSMASH) with network-based clustering (BiG-SCAPE). - Produces organized outputs suitable for downstream comparative genomics. ## Key features - Modular DSL2 design: swap/extend modules under `modules/` and `submodules/`. - Reproducible runtime via Conda/containers (profiles in `conf/`). - Sensible defaults with overridable parameters via `nextflow.config` or CLI. - Caching and resumability: supports `-resume` for efficient re-runs. ## Modules at a glance (non-exhaustive) - Assembly and polishing: Flye, Medaka, NextPolish - Coverage mapping: minimap2/samtools, coverm, strobealign - Binning: SemiBin2, VAMB, MetaCoAG, ComeBIN, plus bin collection utilities - QC and taxonomy: CheckM2, GTDB-Tk - Annotation and function: Bakta (assemblies/bins), eggNOG - BGC discovery: antiSMASH (assemblies/bins), BiG-SCAPE networks - Taxonomic profiling: MMseqs2/MetaBuli helpers ## Inputs - Reads: long reads (ONT/PacBio), optional short reads (Illumina). - Sample sheet: a tab-separated file like `data/samples.tsv` describing sample IDs and file paths. - Reference databases: external DBs required by some tools (e.g., GTDB-Tk, antiSMASH, BiG-SCAPE) are not bundled. Configure their locations via params or environment as appropriate. ## Quick start - Dry run / graph preview: nextflow run . -dsl2 -preview - Example execution (adjust paths and profile to your environment): nextflow run . -profile conda -resume \ --reads '/path/to/*_{R1,R2}.fastq.gz' \ --longreads '/path/to/*.fastq.gz' \ --samples 'data/samples.tsv' \ --outdir 'results' See `conf/` for example profiles (conda, docker, singularity, slurm). Tune resources via `nextflow.config` using `withName:` blocks for process-specific CPU, memory, and time. ## Citation If you use Soil Metagenome Pipeline in your research, please cite the corresponding preprint: - bioRxiv abstract: https://www.biorxiv.org/content/10.1101/2025.05.28.656579v1.abstract - DOI: https://doi.org/10.1101/2025.05.28.656579 A machine-readable citation file (CITATION.cff) is included in the repository root. GitHub will display a "Cite this repository" button. ## License This project is licensed under the GNU General Public License v3.0 or later (GPL-3.0-or-later). See the LICENSE file for the full text.