Workflow Type: Common Workflow Language
Stable

GitHub last commit (branch) Status cwltool License Version Open in Dev Containers

Gene Expression Analysis Workflow in Complex Microbiomes

 

Workflow Schema

image

 

1. Overview of the Workflow

This analysis focuses on transcriptional profiling of complex microbiomes. It requires both metagenomic and metatranscriptomic NGS short-read data, along with annotation reference information (e.g., ribosomal RNA sequences and referenced protein databases, listed below). The metagenomic and metatranscriptomic reads should be derived from the same microbiome samples. Assembled metagenomic contigs are then used as reference sequences to map both types of reads, enabling gene-level quantification.

 

2. Minimum Requirements

  • Docker
  • cwltool

 

3. Workflow Component

This analysis workflow is composed of three sub-workflows; metagenomic contig assembling, reads mapping and annotation.

 

Metagenomic contig assembling

In this process, the following steps are performed:

  1. Assembly process using Megahit.
  2. Prediction Protein sequences using Prodigal.
  3. Statical analysis of contigs useing SeqKit.

 

Reads mapping

In this process, the following steps are performed:

  1. Mapping process using BWA MEM.
  2. Statical analysis of mapping results using SAMtools

 

Annotation

In this process, the following steps are performed:

  1. Searching contaminated ribosomal RNA sequences using BLAST.
  2. Searching referenced proteins using DIAMOND.
  3. Creation GTF formated file contained annotation informations.

 

4. Test Dataset and Your Own Dataset

  • If you are testing with the following files, please place them in the Data directory!
  • You can also obtain metagenomic and metatranscriptomic FASTQ files either by downloading them from public databases or by using your own samples, and then place them in your Data directory.

Metagenome data

Metatranscriptome data

 

5. Annotation References

These reference files are used in the BLAST and DIAMOND processes. The downloaded files are available in the Data directory (accessed on September 17, 2025). If you wish to use the latest versions of the references, please download them using the following scripts.

# rRNA data from SILVA website (release138.1; accessed on 17,September,2025)
curl -O https://ftp.arb-silva.de/release_138.1/Exports/SILVA_138.1_LSUParc_tax_silva.fasta.gz
curl -O https://ftp.arb-silva.de/release_138.1/Exports/SILVA_138.1_SSUParc_tax_silva.fasta.gz

# Swiss-Prot data from UniProt for diamond makedb process (accessed on 17,September,2025)
curl -O https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz

# Pfam data from InterPro (accessed on 17,September,2025)) for hmmscan proess. Appling HMMER process in this workflow is on going, however this process takes time. This step will be optional.
# curl -O https://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz

 

6. Command Execution

We recommend creating a cache directory to store cache and intermediate files. Since metagenomic and metatranscriptomic reads are mapped to contigs, the assembled results can be reused to reduce analytical costs. The cwltool properly recognizes caches when the --cachedir option is specified.

# main workflow

cwltool --debug --cachedir  --outdir  ./Worlkflow/main_w.cwl ./config/main_w.yml

 

7. based shell script & python script

GitHub: https://github.com/RyoMameda/workflow

This workflow is developed at DBCLS BioHackathon 2025, and the preprint of developing project is https://doi.org/10.37044/osf.io/qd5sz_v1.

Click and drag the diagram to pan, double click or use the controls to zoom.

Inputs

ID Name Description Type
SW_THREADS threads number of threads to use in this subworkflow
  • int
SW_EVALUE evalue E-value threshold of BLASTP (diamond) and BLASTN alignment
  • float
SW_BLASTN_rRNA_FASTA_FILE1 SILVA_138.1_LSUParc_tax_silva SILVA_138.1_LSUParc_tax_silva
  • File
SW_BLASTN_rRNA_FASTA_FILE2 SILVA_138.1_SSUParc_tax_silva SILVA_138.1_SSUParc_tax_silva
  • File
SW_PRODIGAL_RESULT_DNA_FASTA_FILE input fasta file (nucleotide sequence generated by prodigal process) predicted protein coding sequences produced by Prodigal process
  • File
SW_PRODIGAL_RESULT_PROTEIN_FASTA_FILE predicted protein coding sequences produced by Prodigal process predicted protein coding sequences produced by Prodigal process
  • File
SW_DIAMOND_INDEX_FILE Protein fasta file for diamond index Protein fasta file for diamond index
  • File
SW_OUTPUT_GTF_FILE_NAME Output GTF file name Output GTF file name
  • string

Steps

ID Name Description
PROCESS_BLASTN_rRNA BLASTN rRNA annotation process n/a
PROCESS_DIAMOND_PROTEIN n/a n/a
PROCESS_GTF_CREATION n/a n/a

Outputs

ID Name Description Type
OUTPUT_GTF_FILE Output GTF file Output GTF file
  • File

Version History

v1.0 (latest) Created 2nd Oct 2025 at 14:24 by Ryo Mameda

main


Frozen v1.0 7d911f0

main @ 95a536b (earliest) Created 19th Sep 2025 at 07:27 by Sora Yonezawa

update README


Frozen main 95a536b
help Creators and Submitter
Creators
Submitter
Discussion Channel
Citation
Mameda, R., & Yonezawa, S. (2025). Gene Expression Analysis Workflow in Complex Microbiomes. WorkflowHub. https://doi.org/10.48546/WORKFLOWHUB.WORKFLOW.1955.2
License
Activity

Views: 313   Downloads: 53

Created: 19th Sep 2025 at 07:27

Last updated: 2nd Oct 2025 at 14:25

help Attributions

None

Total size: 6.1 MB
Powered by
(v.1.17.0-main)
Copyright © 2008 - 2025 The University of Manchester and HITS gGmbH