Gene Expression Analysis Workflow in Complex Microbiomes
Workflow Schema
1. Overview of the Workflow
This analysis focuses on transcriptional profiling of complex microbiomes. It requires both metagenomic and metatranscriptomic NGS short-read data, along with annotation reference information (e.g., ribosomal RNA sequences and referenced protein databases, listed below). The metagenomic and metatranscriptomic reads should be derived from the same microbiome samples. Assembled metagenomic contigs are then used as reference sequences to map both types of reads, enabling gene-level quantification.
2. Minimum Requirements
Docker
cwltool
3. Workflow Component
This analysis workflow is composed of three sub-workflows; metagenomic contig assembling, reads mapping and annotation.
Metagenomic contig assembling
In this process, the following steps are performed:
- Assembly process using
Megahit
. - Prediction Protein sequences using
Prodigal
. - Statical analysis of contigs useing
SeqKit
.
Reads mapping
In this process, the following steps are performed:
- Mapping process using
BWA MEM
. - Statical analysis of mapping results using
SAMtools
Annotation
In this process, the following steps are performed:
- Searching contaminated ribosomal RNA sequences using
BLAST
. - Searching referenced proteins using
DIAMOND
. - Creation GTF formated file contained annotation informations.
4. Test Dataset and Your Own Dataset
- If you are testing with the following files, please place them in the
Data
directory! - You can also obtain metagenomic and metatranscriptomic FASTQ files either by downloading them from public databases or by using your own samples, and then place them in your
Data
directory.
Metagenome data
Metatranscriptome data
5. Annotation References
These reference files are used in the BLAST and DIAMOND processes. The downloaded files are available in the Data
directory (accessed on September 17, 2025). If you wish to use the latest versions of the references, please download them using the following scripts.
# rRNA data from SILVA website (release138.1; accessed on 17,September,2025)
curl -O https://ftp.arb-silva.de/release_138.1/Exports/SILVA_138.1_LSUParc_tax_silva.fasta.gz
curl -O https://ftp.arb-silva.de/release_138.1/Exports/SILVA_138.1_SSUParc_tax_silva.fasta.gz
# Swiss-Prot data from UniProt for diamond makedb process (accessed on 17,September,2025)
curl -O https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
# Pfam data from InterPro (accessed on 17,September,2025)) for hmmscan proess. Appling HMMER process in this workflow is on going, however this process takes time. This step will be optional.
# curl -O https://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz
6. Command Execution
We recommend creating a cache
directory to store cache and intermediate files. Since metagenomic and metatranscriptomic reads are mapped to contigs, the assembled results can be reused to reduce analytical costs. The cwltool
properly recognizes caches when the --cachedir
option is specified.
# main workflow
cwltool --debug --cachedir --outdir ./Worlkflow/main_w.cwl ./config/main_w.yml
7. based shell script & python script
GitHub: https://github.com/RyoMameda/workflow
This workflow is developed at DBCLS BioHackathon 2025, and the preprint of developing project is https://doi.org/10.37044/osf.io/qd5sz_v1.
Click and drag the diagram to pan, double click or use the controls to zoom.
Inputs
ID | Name | Description | Type |
---|---|---|---|
SW_THREADS | threads | number of threads to use in this subworkflow |
|
SW_EVALUE | evalue | E-value threshold of BLASTP (diamond) and BLASTN alignment |
|
SW_BLASTN_rRNA_FASTA_FILE1 | SILVA_138.1_LSUParc_tax_silva | SILVA_138.1_LSUParc_tax_silva |
|
SW_BLASTN_rRNA_FASTA_FILE2 | SILVA_138.1_SSUParc_tax_silva | SILVA_138.1_SSUParc_tax_silva |
|
SW_PRODIGAL_RESULT_DNA_FASTA_FILE | input fasta file (nucleotide sequence generated by prodigal process) | predicted protein coding sequences produced by Prodigal process |
|
SW_PRODIGAL_RESULT_PROTEIN_FASTA_FILE | predicted protein coding sequences produced by Prodigal process | predicted protein coding sequences produced by Prodigal process |
|
SW_DIAMOND_INDEX_FILE | Protein fasta file for diamond index | Protein fasta file for diamond index |
|
SW_OUTPUT_GTF_FILE_NAME | Output GTF file name | Output GTF file name |
|
Steps
ID | Name | Description |
---|---|---|
PROCESS_BLASTN_rRNA | BLASTN rRNA annotation process | n/a |
PROCESS_DIAMOND_PROTEIN | n/a | n/a |
PROCESS_GTF_CREATION | n/a | n/a |
Outputs
ID | Name | Description | Type |
---|---|---|---|
OUTPUT_GTF_FILE | Output GTF file | Output GTF file |
|
Version History
main @ 95a536b (earliest) Created 19th Sep 2025 at 07:27 by Sora Yonezawa
update README
Frozen
main
95a536b

Creators
Submitter
Views: 314 Downloads: 53
Created: 19th Sep 2025 at 07:27
Last updated: 2nd Oct 2025 at 14:25

None