Vignette demonstrating microbial-like sequence discovery workflow¶
Nikolay Oskolkov, SciLifeLab, NBIS Long Term Support, nikolay.oskolkov@scilifelab.se¶
Abstract
In this vignette, we will demonstrate how to prepare and run the workflow detecting microbial-like sequeneces in eukaryotic reference genomes. The workflow accepts a eukaryotic reference in FASTA-format and outputs coordinates of microbial-like regions together with microbial species annotation.
Table of Contents¶

Prepare input files ¶
For demonstration purposes we are going to use the reference genome of Bathycoccus prasinos which is a green algae (picoplankton) eukaryotic organism related to plants. The reference genome GCF_002220235.1 of this algae is small (15 Mb) and therefore computationally easy to handle. The worflow together with the test-files is available at the following github address: https://github.com/NikolayOskolkov/MCWorkflow. Let us first clone th github repository and inspect its content:
cd /home/nikolay
git clone https://github.com/NikolayOskolkov/MCWorkflow
cd MCWorkflow
Cloning into 'MCWorkflow'... remote: Enumerating objects: 86, done. remote: Counting objects: 100% (86/86), done. remote: Compressing objects: 100% (68/68), done. remote: Total 86 (delta 44), reused 38 (delta 13), pack-reused 0 (from 0) Unpacking objects: 100% (86/86), done.
ls -l
total 7640 drwxrwxr-x 2 nikolay nikolay 4096 aug 4 12:01 data -rw-rw-r-- 1 nikolay nikolay 152 aug 4 12:01 environment.yaml -rwxrwxr-x 1 nikolay nikolay 7089 aug 4 12:01 extract_coords_micr_contam.R -rw-rw-r-- 1 nikolay nikolay 3766396 aug 4 12:01 GTDB_fna2name.txt -rw-rw-r-- 1 nikolay nikolay 3675565 aug 4 12:01 GTDB_sliced_seqs_sliding_window.fna.gz drwxrwxr-x 2 nikolay nikolay 4096 aug 4 12:01 images -rw-rw-r-- 1 nikolay nikolay 4 aug 4 12:01 LICENSE.txt -rw-rw-r-- 1 nikolay nikolay 1191 aug 4 12:01 main.nf -rwxrwxr-x 1 nikolay nikolay 5269 aug 4 12:01 micr_cont_detect.sh -rw-rw-r-- 1 nikolay nikolay 860 aug 4 12:01 nextflow.config -rw-rw-r-- 1 nikolay nikolay 2388 aug 4 12:01 README.md -rw-rw-r-- 1 nikolay nikolay 306912 aug 4 12:01 vignette.html -rw-rw-r-- 1 nikolay nikolay 23410 aug 4 12:01 vignette.ipynb
The workflow consists of four main files:
- micr_cont_detect.sh - main shell-script that builds eukaryotic reference genome index, runs alignments and does pre- and post-processing
- extract_coords_micr_contam.R - helping R script that extracts coordinates of covered regions and annotates them with microbial species
- GTDB_sliced_seqs_sliding_window.fna.gz - prepared 60 bp long GTDB microbial pseudo-reads to be aligned to eukaryotic references
- GTDB_fna2name.txt - annotation of pseudo-reads with scientific microbial names according to GTDB taxonomy
Currently it is important that all the four files are located in the same directory. We recommend to perform all the analysis in this cloned github repository in order to avoid complications with changing the paths and hacking the codes. Please note that the GTDB_sliced_seqs_sliding_window.fna.gz file is a small (3.7 MB) demonstration subset of the complete dataset (234 GB) available at the SciLifeLab Figshare https://doi.org/10.17044/scilifelab.28491956 which should be used for real-world applications, i.e. for your research you will have to replace the toy-dataset from the github repository by the large one (with identical file name) from the SciLifeLab Figshare. The toy-dataset includes pseudo-reads from only one bacterium named as UBA796 sp002707085 (with corresponding reference 5oFfr3yp0G.fna) belonging to the phylum Myxococcota in the GTDB databse.
Next, we recommend to place the eukaryotic reference genomes to be profiled for microbial-like sequences into the folder data within the cloned repository as shown above. Right now, the "data"-folder is already present in the cloned repository and the Bathycoccus prasinos reference genome GCF_002220235.fna.gz has been already placed within this folder. Let us double-check that the eukaryotic reference genome is indeed inside the "data"-folder:
cd data && ls -l && cd ..
total 4832
-rwxrwxr-x 1 nikolay nikolay 4941250 aug 4 12:01 GCF_002220235.fna.gz
Finally, we make sure that we are going to start the workflow, i.e. call the micr_cont_detect.sh script from the cloned github repository:
pwd
/home/nikolay/MCWorkflow
Now everything is ready for launching the workflow.
Running workflow ¶
Before you start the workflow, please make sure that R, awk, Bowtie2 and samtools are installed and available in your path. If you run the workflow on an HPC cluster, please make sure that you have loaded the corresponding modules. The workflow has the following format:
./micr_cont_detect.sh REF_GENOME INPUT_DIR TYPE_OF_PSEUDO_READS THREADS PSEUDO_READS N_ALLOWED_MULTIMAPPERS
where:
- REF_GENOME - gzipped eukaryotic reference genome in FASTA-format (no path is needed, just the name of the file)
- INPUT_DIR - directory containing the eukaryotic reference genome (here you need to provide the absolute path)
- TYPE_OF_PSEUDO_READS - whether RefSeq or GTDB or human pseudo-reads are used, can only be "RefSeq" or "GTDB" or "human"
- THERADS - number of threads available
- PSEUDO_READS - GTDB, RefSeq or human pseudo-reads provided together with workflow (no path is needed, just the name of the file)
- N_ALLOWED_MULTIMAPPERS - number of multi-mapping pseudo-reads allowed by Bowtie2
Now we can start the workflow with the following command line:
./micr_cont_detect.sh GCF_002220235.fna.gz /home/nikolay/MCWorkflow/data GTDB 4 GTDB_sliced_seqs_sliding_window.fna.gz 10
PREPARING FILES FOR ANALYSIS OF GCF_002220235.fna.gz REFERENCE GENOME BUILDING BOWTIE2 INDEX FOR GCF_002220235.fna.gz REFERENCE GENOME ALIGNING MICROBIAL READS WITH BOWTIE2 TO GCF_002220235.fna.gz REFERENCE GENOME [bam_sort_core] merging from 0 files and 4 in-memory blocks... RANKING GCF_002220235.fna.gz CONTIGS BY NUMBER OF MAPPED MICROBIAL READS COMPUTING BREADTH OF COVERAGE FOR EACH CONTIG AND COORDINATES OF MICROBIAL CONTAMINATION FOR GCF_002220235.fna.gz REFERENCE GENOME NC_023997.1 CONTIG OF GCF_002220235.fna.gz EXTRACTING COORDINATES OF MICROBIAL CONTAMINATION DELETING BAM AND COMPRESSING BOC FILES NC_024004.1 CONTIG OF GCF_002220235.fna.gz EXTRACTING COORDINATES OF MICROBIAL CONTAMINATION DELETING BAM AND COMPRESSING BOC FILES NC_024008.1 CONTIG OF GCF_002220235.fna.gz EXTRACTING COORDINATES OF MICROBIAL CONTAMINATION DELETING BAM AND COMPRESSING BOC FILES NC_023992.1 CONTIG OF GCF_002220235.fna.gz EXTRACTING COORDINATES OF MICROBIAL CONTAMINATION DELETING BAM AND COMPRESSING BOC FILES NC_024006.1 CONTIG OF GCF_002220235.fna.gz EXTRACTING COORDINATES OF MICROBIAL CONTAMINATION DELETING BAM AND COMPRESSING BOC FILES NC_024001.1 CONTIG OF GCF_002220235.fna.gz EXTRACTING COORDINATES OF MICROBIAL CONTAMINATION DELETING BAM AND COMPRESSING BOC FILES NC_024000.1 CONTIG OF GCF_002220235.fna.gz EXTRACTING COORDINATES OF MICROBIAL CONTAMINATION DELETING BAM AND COMPRESSING BOC FILES NC_024003.1 CONTIG OF GCF_002220235.fna.gz EXTRACTING COORDINATES OF MICROBIAL CONTAMINATION DELETING BAM AND COMPRESSING BOC FILES AGGREGATING RESULTS FOR GCF_002220235.fna.gz REFERENCE GENOME AND CLEANING COMPUTING LIST OF MOST ABUNDANT MICROBES CONTMAMINATING GCF_002220235.fna.gz REFERENCE GENOME ANALYSIS FOR GCF_002220235.fna.gz REFERENCE GENOME FINISHED SUCCESSFULLY
For the toy-dataset and the small eukaryotic reference genome, the workflow takes only a few seconds to finish. Please note that for real-world applications, the alignment step is the most time consuming. Since the full GTDB sliced microbial pseudo-reads data set includes 26 billion reads, to our experience, the alignment to e.g. mammalian reference genomes can take up to 48 hours on an HPC compute node with 20 cores. Multi-threading is crucial here, more available threads may considerable speed up the workflow execution.
Interepreting results ¶
Let us now go through the main outputs-files of the workflow. First of all, we see the bt2l Bowtie2 index-files within the data-folder, and the verbose output of bowtie2-build command was written to bowtie2-build.log file.
cd data && ls -l
total 49432 -rw-rw-r-- 1 nikolay nikolay 33688 aug 4 12:09 bowtie2-build.log -rwxrwxr-x 1 nikolay nikolay 4941250 aug 4 12:01 GCF_002220235.fna.gz -rw-rw-r-- 1 nikolay nikolay 13403961 aug 4 12:08 GCF_002220235.fna.gz.1.bt2l -rw-rw-r-- 1 nikolay nikolay 7519068 aug 4 12:08 GCF_002220235.fna.gz.2.bt2l -rw-rw-r-- 1 nikolay nikolay 709 aug 4 12:08 GCF_002220235.fna.gz.3.bt2l -rw-rw-r-- 1 nikolay nikolay 3759531 aug 4 12:08 GCF_002220235.fna.gz.4.bt2l drwxrwxr-x 2 nikolay nikolay 4096 aug 4 12:09 GCF_002220235.fna.gz_GTDB -rw-rw-r-- 1 nikolay nikolay 13403961 aug 4 12:09 GCF_002220235.fna.gz.rev.1.bt2l -rw-rw-r-- 1 nikolay nikolay 7519068 aug 4 12:09 GCF_002220235.fna.gz.rev.2.bt2l
Further, all the main outputs of the workflow were placed to the GCF_002220235.fna.gz_GTDB folder, let us navigate to the folder and check its content:
cd GCF_002220235.fna.gz_GTDB && ls -l
total 128 -rw-rw-r-- 1 nikolay nikolay 319 aug 4 12:09 contigs_abund_sorted_GTDB_GCF_002220235.fna.gz.txt -rw-rw-r-- 1 nikolay nikolay 477 aug 4 12:09 contigs_boc_sorted_GTDB_GCF_002220235.fna.gz.txt -rw-rw-r-- 1 nikolay nikolay 36549 aug 4 12:09 coords_micr_contam_GCF_002220235.fna.gz.txt -rw-rw-r-- 1 nikolay nikolay 23 aug 4 12:09 microbes_abundant_GTDB_GCF_002220235.fna.gz.txt -rw-rw-r-- 1 nikolay nikolay 1244 aug 4 12:09 NC_023992.1__GCF_002220235.fna.gz.boc.gz -rw-rw-r-- 1 nikolay nikolay 1858 aug 4 12:09 NC_023997.1__GCF_002220235.fna.gz.boc.gz -rw-rw-r-- 1 nikolay nikolay 2016 aug 4 12:09 NC_024000.1__GCF_002220235.fna.gz.boc.gz -rw-rw-r-- 1 nikolay nikolay 2171 aug 4 12:09 NC_024001.1__GCF_002220235.fna.gz.boc.gz -rw-rw-r-- 1 nikolay nikolay 2147 aug 4 12:09 NC_024003.1__GCF_002220235.fna.gz.boc.gz -rw-rw-r-- 1 nikolay nikolay 2404 aug 4 12:09 NC_024004.1__GCF_002220235.fna.gz.boc.gz -rw-rw-r-- 1 nikolay nikolay 2504 aug 4 12:09 NC_024006.1__GCF_002220235.fna.gz.boc.gz -rw-rw-r-- 1 nikolay nikolay 3108 aug 4 12:09 NC_024008.1__GCF_002220235.fna.gz.boc.gz -rw-rw-r-- 1 nikolay nikolay 43463 aug 4 12:09 PseudoReads_aligned_to_GCF_002220235.fna.gz.bam -rw-rw-r-- 1 nikolay nikolay 308 aug 4 12:09 PseudoReads_aligned_to_GCF_002220235.fna.gz.bam.csi
Here you can see a number of files. Probably the main file is coords_micr_contam_GCF_002220235.fna.gz.txt, this is the coordinates of microbial-like regions in BED-format (despite the file does not have *.bed - extension). The columns in this file have the following meaning: 1) name (id) of the eukaryotic reference genome profiles, 2) contig / scaffold / chromosome id within the eukaryotic reference genome containing microvbial-like region, 3) start coordinate of the detected microbial-like region, 4) end coordinate of the detected microbial-like region, 5) genomic length of the microbial-like region, 6) total number of reads aligned to the detected microbial-like region, 7) average number of reads supporting each position within the detected microbial-like region, 8) the next five columns represent top abundant microbial species for each detected microbial-like region (the number of reads is reported for each of the top abundant microbes); if fewer than five unique microbes are dicovered within the microbial-like region, the rest of the columns contain recods "NA_reads_NA".
head coords_micr_contam_GCF_002220235.fna.gz.txt | cut -f1-10
ORGANISM CONTIG START END LENGTH N_READS AVERAGE_DEPTH MICR1 MICR2 MICR3 GCF_002220235.fna.gz NC_023997.1 231657 231716 59 1 1 1_reads_UBA796_sp002707085 NA_reads_NA NA_reads_NA GCF_002220235.fna.gz NC_023997.1 231807 232156 349 26 4 26_reads_UBA796_sp002707085 NA_reads_NA NA_reads_NA GCF_002220235.fna.gz NC_023997.1 232752 232891 139 9 4 9_reads_UBA796_sp002707085 NA_reads_NA NA_reads_NA GCF_002220235.fna.gz NC_023997.1 232905 232964 59 1 1 1_reads_UBA796_sp002707085 NA_reads_NA NA_reads_NA GCF_002220235.fna.gz NC_023997.1 233015 233104 89 4 3 4_reads_UBA796_sp002707085 NA_reads_NA NA_reads_NA GCF_002220235.fna.gz NC_023997.1 233145 233369 224 13 3 13_reads_UBA796_sp002707085 NA_reads_NA NA_reads_NA GCF_002220235.fna.gz NC_023997.1 233373 233377 4 1 1 1_reads_UBA796_sp002707085 NA_reads_NA NA_reads_NA GCF_002220235.fna.gz NC_023997.1 233395 233494 99 3 2 3_reads_UBA796_sp002707085 NA_reads_NA NA_reads_NA GCF_002220235.fna.gz NC_023997.1 233515 233694 179 11 4 11_reads_UBA796_sp002707085 NA_reads_NA NA_reads_NA
Further, congigs within the eukaryotic reference genome sorted by the total number of aligned microbial pseuso-reads and contig-wide breadth of coverage are reported in contigs_abund_sorted_GTDB_GCF_002220235.fna.gz.tx and contigs_boc_sorted_GTDB_GCF_002220235.fna.gz.txt files, respectively.
head contigs_abund_sorted_GTDB_GCF_002220235.fna.gz.txt
ORGANISM CONTIG N_READS GCF_002220235.fna.gz NC_023997.1 347 GCF_002220235.fna.gz NC_024004.1 274 GCF_002220235.fna.gz NC_024008.1 221 GCF_002220235.fna.gz NC_023992.1 193 GCF_002220235.fna.gz NC_024006.1 186 GCF_002220235.fna.gz NC_024001.1 175 GCF_002220235.fna.gz NC_024000.1 123 GCF_002220235.fna.gz NC_024003.1 59
head contigs_boc_sorted_GTDB_GCF_002220235.fna.gz.txt
ORGANISM CONTIG N_READS LENGTH BOC GCF_002220235.fna.gz NC_023997.1 347 712459 0.00871348 GCF_002220235.fna.gz NC_023992.1 193 465570 0.00777327 GCF_002220235.fna.gz NC_024004.1 274 1019276 0.00525471 GCF_002220235.fna.gz NC_024008.1 221 1352724 0.00400451 GCF_002220235.fna.gz NC_024001.1 175 937610 0.00396433 GCF_002220235.fna.gz NC_024006.1 186 1091008 0.00389915 GCF_002220235.fna.gz NC_024000.1 123 895536 0.00294349 GCF_002220235.fna.gz NC_024003.1 59 989707 0.00161462
The N_READS coluumn in both files reports the number of microbial pseudo-reads aligned to each contig / scaffold / chromosome within the eukaryotic reference genome. In addition, the LENGTH and BOC columns in the "boc_sorted" file report the length of each contig / scaffold / chromosome in the eukaryotic reference and the breadth of coverage (BOC - fraction of reference nucleotides covered at least once), respectively. Finally, microbes_abundant_GTDB_GCF_002220235.fna.gz.txt file contains GTDB microbial species (their reference genome ids) sorted by their total abundance within the whole eukaryotic reference genome. Note that this file will be absent when screening a reference genome for human contamination since all aligned pseudo-reads belong to human in this case.
head microbes_abundant_GTDB_GCF_002220235.fna.gz.txt
1578 5oFfr3yp0G.fna
In this particular example, we had only one GTDB microbial species UBA796 sp002707085 (with the reference 5oFfr3yp0G.fna) belonging to the phylum Myxococcota in the GTDB databse. All the 1578 micobial pseudo-reads aligned to the green algae reference genome belonged to that single microbe. The bam-file containng the alignments is PseudoReads_aligned_to_GCF_002220235.fna.gz.bam, let us quickly check that there are 1578 aligned reads in total.
samtools view PseudoReads_aligned_to_GCF_002220235.fna.gz.bam | wc -l
1578
The *.boc.gz files contain the outputs from samtools depth for each contig / scaffold / chromosome within the eukaryotic reference genome. They are not very important and can be deleted for the sake of disk space.