Snakemake workflow for PacBio WGS short and long variant calling, phasing and much more
main @ 06d87f1

Workflow Type: Snakemake
Work-in-progress

pb_variants

A snakemake 9 based Pipeline for hifi snp, sv, cnv calling, phasing and more

Only PacBio data for now

!!THIS PIPLINE IS IN-DEVELOPMENT AND EXPERIMENTAL, USE AT YOUR OWN RISK!!

what this tools aims to deliver:

- newest and best tools suited for HiFi data (only for now)
- singletons and trio analysis (trio is coming sometime...)
- human-first (hg38 for now), others should be possible (untested...)

included tools:

  • deepvariant or bcftools for snp calling
  • snps get used for phasing with whatshap and longphase
  • paraphase
  • trgt
  • hificnv
  • pb-cpg-tools (uses whatshap phased .bam file)
  • mitorsaw (just hg38)
  • sniffles for sv calls that get phased with longphase
  • sawfish for svs and cnvs (results phased by sawfish)
  • mosdepth, multiqc
  • pbmm2 for mapping
  • kraken2 for contamination detection (downsamples massively, needs kraken2 database)
  • demultiplexing of input as option, will not split the files per barcode.
  • for now one .bam per sample
  • NanoCaller for phased snp/indel calls
  • Variants get annotated if configured through config.yaml:
    • SVs with sansa and AnnotSV
    • SNPs with vep and snpsift
    • both only tested with hg38 for now

how to run

  • make sure you have a conda environment active with snakemake9+ (called smk9 in the runPipeline_local.sh)
    • this can also be achieved by running the included setupPipeline_hpc.sh
      • that script uses conda to create the env smk9 - with snakemake 9 installed already (check file smk9.yaml)
  • cp/mv/ln your unmapped .bam file into the root folder of this directory (pb_variants/.)
  • edit samplesheet.csv with your filename
    • one sample per line, do not delete the header line
  • edit config.yaml to your liking/ folder structure
  • make sure you are in an interactive terminal session inside a screen / tmux or similar
  • bash runPipeline_local.sh for local installment on single-server setups,
  • bash runPipeline.sh on HPC
  • non-hpc users need to edit the config.yaml and enable deepvariant and disable hpc in the config.yaml: use_deepvariant_hpc: True <- only set this to True on HPC HILBERT

DAG

This DAG was made:

  • with snakemake --rulegraph option
  • demultiplexing disabled through config.yaml setting
  • deepvariant enabled through config.yaml setting

alt text

output files

  • the first step of the pipeline is to strip the kinetics data out of the .bam input file, but keep the methylation data. This makes all following processes much faster without any real data loss.
  • for each input sample:
    • mosdepth and kraken (optional) report that get summarized with multiqc
    • mapped .bam file haplotaged with whatshap and longphase
    • the with whatshap phased bam is used for methylation track generation with cpg_tools
    • bed/bw file for methylation tracks for IGV, should be used together with the whatshap phased output .bam file
    • .vcf(.gz) file for:
      • snps / indels from deepvariant or bcftools, phased with whatshap and longphase
      • trgt
      • paraphase
      • mitorsaw
      • hificnv
      • sawfish sv / cnv
      • svs from sniffles phased with longphase
      • snps / indels from nanocaller
    • phased snps and svs get annotated with snpsift and sansa and annotsv and vep
  • snakemake report, rulegraph, copy of samplesheet and config.yaml with timestamp

general workflow in short:

prep the files:

    • move all .bam files into the folder pb_variants. (up to 100 at once makes sense, more overloads the hpc for sure)
    • with methylation, no kinetics needed
    • edit the config.yaml according to your needs. Mostly True/False. If unsure, keep as is but CHANGE THE OUTPUT DIRECTORY
    • edit the samplesheet.csv: keep the first line and then list all .bam files that you want to analyze in the run. all files listed in that file NEED TO BE IN THE FOLDER

now the actual pipeline execution:

    • start a screen session on the hpc
    • start an interactive job inside the screen session with at least 2 days runtime
    • cd into the folder pb_variants
    • activate the correct conda env with: "conda activate smk9"
    • run the pipeline with:"bash runPipeline.sh"

supervising the run:

  • check the screen regulary if you want -red colour is bad

  • qstat -u your_hpc_username if you want

  • check outputfolder if files are appearing

  • check clusterlogs_your_hpc_username for errors

  • check outputfolder/logs for errors that occured

  • if a multiqc_report.html is in the outputfolder then the pipeline is done

  • if the interactive job is done, but the pipeline is not done yet, notify me

  • if no output files arrive after some time, notify me

  • if results look odd, notify me

  • if you want to understand more, notify me

  • if some (except logfiles) output files are 0 byte, notify me

transfering results:

  • you can transfer the complete outputfolder if you want. all results are stored in subfolders that should explain themselves.

roadmap:

´ trio calling : deeptrio, glnexus´

roadmap:

  • trio calling : deeptrio, glnexus

why this work is being done:

Version History

main @ 06d87f1 (earliest) Created 30th Sep 2025 at 13:42 by dan rick

mini improvements


Frozen main 06d87f1
help Creators and Submitter
Creators
Not specified
Submitter
Activity

Views: 42   Downloads: 12

Created: 30th Sep 2025 at 13:42

Annotated Properties
Topic annotations
help Attributions

None

Total size: 239 KB
Powered by
(v.1.17.0-main)
Copyright © 2008 - 2025 The University of Manchester and HITS gGmbH