CLIP-seq Workflow
A Nextflow workflow for end-to-end processing of CLIP-seq data, supporting multiple CLIP protocols.
Overview
Starting from raw FASTQ files (or un-demultiplexed iCLIP data), the workflow processes reads through quality control, adapter trimming, rRNA removal, genome alignment, and UMI deduplication, then runs shoji to extract crosslink sites and produce per-sample and combined count matrices ready for differential binding analysis (see DEWSeq).
Workflow steps
- Demultiplexing (optional, iCLIP only)
- Quality Control
- UMI pre-processing (optional, R2-CLIP only)
- Adapter and Quality Trimming
- Fastq data sketching and similarity comparison
- rRNA filtering (optional)
- Alignment
- Contamination estimation
- UMI deduplication (optional)
- Downstream processing
- Final statistics report
- Per-sample read counts at each processing stage (raw → trimmed → rRNA-filtered → aligned → deduplicated), plus Kraken2 classification summary
Prerequisites
- Java 11+ (required by Nextflow)
- Nextflow - tested versions:
- One of the following for software environments:
- Apptainer (formerly Singularity) - recommended for HPC, see apptainer configs
- Conda / Mamba - see conda configs
⚠️ If using conda/mamba, ensure no active conda environment is loaded before launching Nextflow, as it can interfere with the JRE.
Reference files
The genome profiles (conf/genome/) contain organism specific reference configs.
ℹ️ See this note about creating genome specific configs
Before running, you will need to prepare and configure paths to:
| File | Used by | Description |
|---|---|---|
| STAR genome index | STAR | Build using STAR --runMode genomeGenerate against your genome FASTA or FASTA + GTF |
| GFF3 annotation | shoji | Gene annotation file (GENCODE). See this section about using annotation files from non GENCODE sources. |
| Genome FAI | tracks | FASTA index (.fa.fai) for the genome, used to set chromosome sizes for track generation (see samtools faidx) |
| rRNA FASTA | bbduk | Reference sequences for rRNA filtering |
| Kraken2 database | Kraken2 | Pre-built Kraken2 database (e.g. from https://benlangmead.github.io/aws-indexes/k2). See kraken2 section |
⚠️ Edit the relevant genome config (e.g. conf/genome/hsa.config or conf/genome/rDNA.config) to point to your local copies.
GFF3
⚠️ When using GFF3 files from sources other than GENCODE, shoji paramaters corresponding to gene id, name, type and optionally feature needs to supplied. See shoji annotation documentation for a description of these parameters. In hsa and rDNA configs, edit the variable
annotation_paramsto fit the attribute names in the GFF3 file being used.
Kraken2
⚠️ Kraken2 config parameters
db,nodesandnamesare placeholder pathes. Edit these to point to actual files before running the workflow
| Parameter | Required file | Description |
|---|---|---|
db |
Kraken2 index file | See Kraken 2 index for a list of downloadable index files |
nodes |
NCBI taxonomy db nodes.dmp file |
See this readme |
names |
NCBI taxonomy db names.dmp file |
See this readme |
ℹ️ See this shell script for an example supplying these files using command line parameters
Built-in profiles
Supported protocols
| Profile | Sequencing type | Description |
|---|---|---|
eCLIP |
paired-end | ⚠️ two-step adapter trimming (cutadapt) and UMI deduplication |
iCLIP |
single-end | barcode demultiplexing + UMI extraction via flexbar |
R2CLIP |
paired-end | Read 2 is expected to contain only UMIs,and after UMI extraction Read 1 is processed as single-end |
soniCLIP |
single-end | no demultiplexing or deduplication |
⚠️ The current version of eCLIP profile is designed to handle UMI-extracted reads available from the ENCODE portal
Genomes
| Profile | Description |
|---|---|
hsa |
Human GRCh38 / GENCODE v42 primary assembly |
rDNA |
Human hg38 with rDNA-masked genome (for rRNA binding RBPs); rRNA trimming disabled by default. rDNA genomes for human and mouse are available from this reference |
ℹ️ it is also possible to skip creating/using genome configs altogether and supply these reference files using parameters. See this soniCLIP shell script template for an example
Run environments
| Profile | Description |
|---|---|
apptainer |
Runs processes inside Apptainer containers (paths should be configured separately - see conf/containers/README.md) |
conda |
Creates and caches conda environments per process (see conf/conda/ and conda config) |
slurm |
SLURM executor settings (see conf/run/embl_hd.config); adapt queue names and resource limits for your cluster |
Using profiles
Profiles are combined with commas. See nextflow.config for the full list.
nextflow run ... -profile slurm,apptainer,eCLIP,hsa
This runs the workflow on a SLURM cluster using Apptainer containers, the eCLIP protocol, and hg38 genome alignment.
ℹ️ The
slurmprofile is pre-configured for the EMBL Heidelberg HPC. For other SLURM clusters, copy conf/run/embl_hd.config, adjust queue names and resource parameters, and reference your copy in nextflow.config.
Workflow
Sample sheet format
This workflow uses nf-schema plugin and the supported sample sheet format.
For eCLIP, R2-CLIP and soniCLIP protocols, the following columns (in csv) is expected:
eCLIP
eCLIP: fastq_2 column MUST be provided.
| sample | fastq_1 | fastq_2 |
|---|---|---|
| sample1 | /path/to/sample1_R1.fastq.gz | /path/to/sample1_R2.fastq.gz |
R2-CLIP
R2-CLIP: fastq_1 for acutal reads, and fastq_2 is expected to contain only UMIs.
umi_tools extract is used to extract UMIs from fastq_2 (based on parameter bc_pattern in config file) and add them to fastq_1 headers and are then processed as regular single-end reads.
| sample | fastq_1 | fastq_2 |
|---|---|---|
| sample1 | /path/to/sample1_R1.fastq.gz | /path/to/sample1_R2.fastq.gz |
soniCLIP
soniCLIP: only uses fastq_1
| sample | fastq_1 |
|---|---|
| sample1 | /path/to/sample1.fastq.gz |
iCLIP
For iCLIP protocol, the following columns (in csv) is expected:
| fastq | barcode |
|---|---|
| /path/to/run1.fastq.gz | /path/to/run1_barcode.fa |
fastqcolumn contains the path to the raw, un-demultiplexed fastq files.barcodecolumn contains the path to the fasta file with barcodes for demultiplex
barcode fasta file format example:
>sample_1
NNNNATATATATNN
>sample_2
NNNNCGCGCGCGNN
ℹ️ flexbar is used for demultiplexing iCLIP data based on the provided barcodes with corresponding header as sample name. UMIs (Ns in the sequences) are extracted from the reads during demultiplexing and added to fastq header.
ℹ️ iCLIP fastq files that are already processed (demultiplexed and UMI extracted) can also be provided, using the same sample sheet format as for eCLIP/R2-CLIP/soniCLIP (with sample, fastq_1 columns) (see section eCLIP, R2-CLIP and soniCLIP).
Running the workflow
Pull the latest version of the workflow before running:
nextflow pull
Replace with the URL of this repository (e.g. `https://github.com/your-org/clip-seq-nf`). The examples below use a local clone. To run directly from a remote URL, replace `/path/to/workflow` with.
⚠️ most of the example workflows below assumes that there is a genome assembly config with appropriate paths and parameters in the genome folder and that this assembly is included in the nextflow config file
eCLIP with human genome (hg38) on SLURM using conda
iCLIP with human genome (hg38) on SLURM using apptainer
soniCLIP with human genome (hg38) on SLURM using apptainer with custom shoji parameters
soniCLIP without using a genome config on SLURM and conda
ℹ️ The shell script above shows how to use custom genome files without adding a genome config.
Output
Given below is an example output directory structure from this pipeline.
ℹ️ the output directory is defined by nextflow
-output-dirparameter, and the files in this directory will be symbolic links to the files in the work directory, defined by nextflow parameter-work-dir
| Directory | Sub-directory | File | Description |
|---|---|---|---|
| Annotation | Shoji annotation files | ||
| Fastq | Fastq files after trimming | ||
| rRNA_trim | after rRNA read removal | ||
| trim | after rRNA read removal | ||
| Genome_align | Genome alignments | ||
| alignment | bam files, alignment statistics,... | ||
| mapped_fq | mapped reads in fq format | ||
| multimapped_fq | multimapped reads in fq format | ||
| unmapped_fq | un-mapped reads in fq format | ||
| Kraken2 | Kraken 2 output directory | ||
| contamination_check | Kraken2 classification files and contamination reports | ||
| QC | QC files: fastqc and multiqc files | ||
| raw | raw data QC | ||
| rRNA_trim | QC after rRNA read removal | ||
| trim | QC after adapter trimming | ||
| Shoji | Shoji and related outputs | ||
| counts | count files from shoji count |
||
| matrix | Final output matrices for DEWSeq analysis | ||
| sites | bed formatted output files from shoji extract |
||
| tracks | .bw files for visualization |
||
| Sourmash | Sourmash files and plots | ||
| align | for aligned reads | ||
| kraken2 | after Kraken2 contamination estimation | ||
| raw | for raw reads | ||
| rRNA_trim | after rRNA read trimming | ||
| trim | after adapter trimming | ||
| Stats | Read count statistics | ||
| all_samples_combined_stats.csv | read count statistics for all samples from raw reads to alignment, deduplication (optional) and contamination estimation | ||
| ``_all_stats.json | per sample read count statistics in json format |
Developed at: Hentze Group, EMBL Heidelberg
Version History
main @ 771a6f9 (earliest) Created 24th Jun 2026 at 12:05 by Hentze group
Add MIT License
Frozen
main
771a6f9
Creators and SubmitterCreator
Submitter
Views: 0 Downloads: 0
Created: 24th Jun 2026 at 12:05
AttributionsNone
View on GitHub
https://orcid.org/0000-0003-2480-0937