GitHub last commit (branch)

Gene Expression Analysis Workflow in Complex Microbiomes

Workflow Schema

more details: Optimization of Mapping Tools and Investigation of Ribosomal RNA Influence for Data-Driven Gene Expression Analysis in Complex Microbiomes

1. Overview of the Workflow

This analysis focuses on transcriptional profiling of complex microbiomes. It requires both metagenomic and metatranscriptomic NGS short-read data, along with annotation reference information (e.g., ribosomal RNA sequences and referenced protein databases, listed below). The metagenomic and metatranscriptomic reads should be derived from the same microbiome samples. Assembled metagenomic contigs are then used as reference sequences to map both types of reads, enabling gene-level quantification.

2. Minimum Requirements

Docker
cwltool

3. Workflow Component

This analysis workflow is composed of three sub-workflows; metagenomic contig assembling, reads mapping and annotation.

Metagenomic contig assembling

In this process, the following steps are performed:

Assembly process using Megahit.
Prediction Protein sequences using Prodigal.
Statical analysis of contigs useing SeqKit.

Reads mapping

In this process, the following steps are performed:

Mapping process using BWA MEM.
Statical analysis of mapping results using SAMtools

Annotation

In this process, the following steps are performed:

Searching contaminated ribosomal RNA sequences using BLAST.
Searching referenced proteins using DIAMOND.
Creation GTF formated file contained annotation informations.

4. Test Dataset and Your Own Dataset

If you are testing with the following files, please place them in the Data directory!
You can also obtain metagenomic and metatranscriptomic FASTQ files either by downloading them from public databases or by using your own samples, and then place them in your Data directory.

Metagenome data

SRR27548858

Metatranscriptome data

5. Annotation References

These reference files are used in the BLAST and DIAMOND processes. The downloaded files are available in the Data directory (accessed on September 17, 2025). If you wish to use the latest versions of the references, please download them using the following scripts.

# rRNA data from SILVA website (release138.1; accessed on 17,September,2025)
curl -O https://ftp.arb-silva.de/release_138.1/Exports/SILVA_138.1_LSUParc_tax_silva.fasta.gz
curl -O https://ftp.arb-silva.de/release_138.1/Exports/SILVA_138.1_SSUParc_tax_silva.fasta.gz

# Swiss-Prot data from UniProt for diamond makedb process (accessed on 17,September,2025)
curl -O https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz

# Pfam data from InterPro (accessed on 17,September,2025)) for hmmscan proess. Appling HMMER process in this workflow is on going, however this process takes time. This step will be optional.
# curl -O https://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz

6. Command Execution

We recommend creating a cache directory to store cache and intermediate files. Since metagenomic and metatranscriptomic reads are mapped to contigs, the assembled results can be reused to reduce analytical costs. The cwltool properly recognizes caches when the --cachedir option is specified.

# main workflow

cwltool --debug --cachedir  --outdir  ./Worlkflow/main_w.cwl ./config/main_w.yml

7. based shell script & python script

GitHub: https://github.com/RyoMameda/workflow

This workflow is developed at DBCLS BioHackathon 2025, and the preprint of developing project is https://doi.org/10.37044/osf.io/qd5sz_v1.

Inputs

ID	Name	Description	Type
SW_THREADS	threads	number of threads to use in this subworkflow	int
SW_EVALUE	evalue	E-value threshold of BLASTP (diamond) and BLASTN alignment	float
SW_BLASTN_rRNA_FASTA_FILE1	SILVA_138.1_LSUParc_tax_silva	SILVA_138.1_LSUParc_tax_silva	File
SW_BLASTN_rRNA_FASTA_FILE2	SILVA_138.1_SSUParc_tax_silva	SILVA_138.1_SSUParc_tax_silva	File
SW_PRODIGAL_RESULT_DNA_FASTA_FILE	input fasta file (nucleotide sequence generated by prodigal process)	predicted protein coding sequences produced by Prodigal process	File
SW_PRODIGAL_RESULT_PROTEIN_FASTA_FILE	predicted protein coding sequences produced by Prodigal process	predicted protein coding sequences produced by Prodigal process	File
SW_DIAMOND_INDEX_FILE	Protein fasta file for diamond index	Protein fasta file for diamond index	File
SW_OUTPUT_GTF_FILE_NAME	Output GTF file name	Output GTF file name	string

Steps

ID	Name	Description
PROCESS_BLASTN_rRNA	BLASTN rRNA annotation process	n/a
PROCESS_DIAMOND_PROTEIN	n/a	n/a
PROCESS_GTF_CREATION	n/a	n/a

ID	Name	Description	Type
OUTPUT_GTF_FILE	Output GTF file	Output GTF file	File

Name

Description

Type

OUTPUT_GTF_FILE

Output GTF file

File

Version History

v1.0 (latest) Created 2nd Oct 2025 at 14:24 by Ryo Mameda

main

Frozen v1.0 7d911f0

main @ 95a536b (earliest) Created 19th Sep 2025 at 07:27 by Sora Yonezawa

update README

Frozen main 95a536b

Gene Expression Analysis Workflow in Complex Microbiomes
main @ 95a536b (earliest)

v1.0 (latest)

main @ 95a536b (earliest)

Gene Expression Analysis Workflow in Complex Microbiomes

Workflow Schema

1. Overview of the Workflow

2. Minimum Requirements

3. Workflow Component

Metagenomic contig assembling

Reads mapping

Annotation

4. Test Dataset and Your Own Dataset

Metagenome data

Metatranscriptome data

5. Annotation References

6. Command Execution

7. based shell script & python script

Inputs

Steps

Outputs

Version History

v1.0 (latest) Created 2nd Oct 2025 at 14:24 by Ryo Mameda

main @ 95a536b (earliest) Created 19th Sep 2025 at 07:27 by Sora Yonezawa

Creators

Submitter

Gene Expression Analysis Workflow in Complex Microbiomes main @ 95a536b (earliest) v1.0 (latest) main @ 95a536b (earliest)

Gene Expression Analysis Workflow in Complex Microbiomes

Workflow Schema

1. Overview of the Workflow

2. Minimum Requirements

3. Workflow Component

Metagenomic contig assembling

Reads mapping

Annotation

4. Test Dataset and Your Own Dataset

Metagenome data

Metatranscriptome data

5. Annotation References

6. Command Execution

7. based shell script & python script

Inputs

Steps

Outputs

Version History

v1.0 (latest) Created 2nd Oct 2025 at 14:24 by Ryo Mameda

main @ 95a536b (earliest) Created 19th Sep 2025 at 07:27 by Sora Yonezawa

Creators

Submitter

Related items

Gene Expression Analysis Workflow in Complex Microbiomes
main @ 95a536b (earliest)

v1.0 (latest)

main @ 95a536b (earliest)