Research Object Crate for Gene Fetch
Original URL:
https://workflowhub.eu/workflows/1341/ro_crate?version=1
# Gene_fetch
This tool fetches gene sequences from NCBI databases based on taxonomy IDs (taxids) or taxonomic information. It can retrieve both protein and nucleotide sequences for various genes, including protein-coding genes (e.g., cox1, cytb, rbcl, matk) and rRNA genes (e.g., 16S, 18S).
## Feature highlight
- Fetch protein and/or nucleotide sequences from NCBI GenBank database.
- Handles both direct nucleotide sequences and protein-linked nucleotide searches (CDS extraction includes fallback mechanisms for atypical annotation formats).
.
- Support for both protein-coding and rDNA genes.
- Single-taxid mode (-s/--single) for retrieving a specified number of target sequences for a particular taxon (default length thresholds are reduced (protein: 50aa, nucleotide: 100bp)).
- Customisable length filtering thresholds for protein and nucleotide sequences.
- Automatic taxonomy traversal: Uses fetched NCBI taxonomic lineage for a given taxid when sequences are not found at the input taxonomic level. I.e., Search at given taxid level (e.g., species), if no sequences are found, escalate species->phylum until a suitable sequence is found.
- Validates fetched sequence using higher taxonomy, avoiding potential taxonomic homonyms.
- Robust error handling, error and progress logging, and NCBI API rate limits (10 requests/second).
- Handles complex sequence features (e.g., complement strands, joined sequences, WGS entries) in addition to 'simple' cds extaction (if --type nucleotide/both). The tool avoids "unverified" sequences and WGS entries not containing sequence data (i.e. master records).
- 'Checkpointing': if a run fails/crashes, the script can be rerun using the same arguments and it will resume from where it stopped.
- When more than 50 matching sequences are found for a sample, the tool fetches summary information for all matches (using NCBI esummary API), orders them by length, and processes the top 10 longest sequences.
## Contents
- [Installation](#installation)
- [Usage](#usage)
- [Examples](#Examples)
- [Input](#input)
- [Output](#output)
- [Cluster](#running-gene_fetch-on-a-cluster)
- [Supported targets](#supported-targets)
- [Notes](#notes)
- [Benchmarking](#benchmarking)
- [Future developments](#future-developments)
- [Contributions and citation](#contributions-and-citations)
## Installation
First, clone the Gene Fetch GitHub repository to your current path, and enter the Gene Fetch installation directory
```bash
git clone https://github.com/bge-barcoding/gene_fetch
cd gene_fetch
```
Run the commands below to install the necessary dependencies and activate the Conda environment. [Conda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html) must be installed.
```bash
conda env create -n fetch -f fetch.yaml
conda activate fetch
```
Alternatively, you can install the dependencies below directly or in your own Conda environment
```
conda install python>=3.9 pip
pip install ratelimit>=2.2.1
pip install biopython>=1.80
```
## Usage
```bash
python gene_fetch.py -g/--gene --type -i/--in -o/--out
```
* `--h/--help`: Show help and exit.
### Required arguments
* `-g/--gene`: Name of gene to search for in NCBI GenBank database (e.g., cox1/16s/rbcl).
* `--type`: Sequence type to fetch; 'protein', 'nucleotide', or 'both' ('both' will initially search and fetch a protein sequence, and then fetches the corresponding nucleotide CDS for that protein sequence).
* `-i/--in`: Path to input CSV file containing sample IDs and TaxIDs (see [Input](#input) section below).
* `i2/--in2`: Path to alternative input CSV file containing sample IDs and taxonomic information for each sample (see [Input](#input) section below).
* `o/--out`: Path to output directory. The directory will be created if it does not exist.
* `e/--email` and `-k/--api-key`: Email address and associated API key for NCBI account. An NCBI account is required to run this tool (due to otherwise strict API limitations) - information on how to create an NCBI account and find your API key can be found [here](https://support.nlm.nih.gov/kbArticle/?pn=KA-05317).
####= Optional arguments
* `--protein_size`: Minimum protein sequence length filter. Applicable to mode 'normal' and 'single-taxid' search modes (default: 500).
* `--nucleotide_size`: Minimum nucleotide sequence length filter. Applicable to mode 'normal' and 'single-taxid' search modes (default: 1500).
* `s/--single`: Taxonomic ID for 'single-taxid' sequence search mode (`-i` and `-i2` ignored when run with `-s` mode). 'Single-taxid' mode will fetch all target gene or protein sequences on GenBank for a specific taxonomic ID.
* `--max-sequences`: Maximum number of sequences to fetch for a specific taxonomic ID (only applies when run in 'single-taxid' mode).
## Examples
Fetch both protein and nucleotide sequences for COI with default sequence length thresholds.
```
python gene_fetch.py -e your.email@domain.com -k your_api_key \
-g cox1 -o ./output_dir -i ./samples.csv \
--type both
```
Fetch rbcL nucleotide sequences using sample taxonomic information, applying a minimum nucleotide sequence length of 1000bp
```
python gene_fetch.py -e your.email@domain.com -k your_api_key \
-g rbcl -o ./output_dir -i2 ./taxonomy.csv \
--type nucleotide --nucleotide_size 1000
```
Retrieve 1000 available matK protein sequences >400aa for _Arabidopsis thaliana_ (taxid: 3702).
```
python gene_fetch.py -e your.email@domain.com -k your_api_key \
-g matk -o ./output_dir -s 3702 \
--type protein --protein_size 400 --max-sequences 1000
```
## Input
**Example 'samples.csv' input file (-i/--in)**
| ID | taxid |
| --- | --- |
| sample-1 | 177658 |
| sample-2 | 177627 |
| sample-3 | 3084599 |
**Example 'samples_taxonomy.csv' input file (-i2/--in2)**
| ID | phylum | class | order | family | genus | species |
| --- | --- | --- | --- | --- | --- | --- |
| sample-1 | Arthropoda | Insecta | Diptera | Acroceridae | Astomella | Astomella hispaniae |
| sample-2 | Arthropoda | Insecta | Hemiptera | Cicadellidae | Psammotettix | Psammotettix sabulicola |
| sample-3 | Arthropoda | Insecta | Trichoptera | Limnephilidae | Dicosmoecus | Dicosmoecus palatus |
## Output
### 'Normal' mode
```
output_dir/
├── nucleotide/ # Nucleotide sequences. Only populated if '--type nucleotide/both' utilised.
│ ├── sample-1_dna.fasta
│ ├── sample-2_dna.fasta
│ └── ...
├── sample-1.fasta # Protein sequences.
├── sample-2.fasta
├── sequence_references.csv # Sequence metadata.
├── failed_searches.csv # Failed search attempts (if any).
└── gene_fetch.log # Log.
```
**sequence_references.csv output example**
| ID | taxid | protein_accession | protein_length | nucleotide_accession | nucleotide_length | matched_rank | ncbi_taxonomy | reference_name | protein_reference_path | nucleotide_reference_path |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| sample-1 | 177658 | AHF21732.1 | 510 | KF756944.1 | 1530 | genus:Apatania | Eukaryota; ...; Apataniinae; Apatania | sample-1 | abs/path/to/protein_references/sample-1.fasta | abs/path/to/protein_references/sample-1_dna.fasta |
| sample-2 | 2719103 | QNE85983.1 | 518 | MT410852.1 | 1557 | species:Isoptena serricornis | Eukaryota; ...; Chloroperlinae; Isoptena | sample-2 | abs/path/to/protein_references/sample-2.fasta | abs/path/to/protein_references/sample-2_dna.fasta |
| sample-3 | 1876143 | YP_009526503.1 | 512 | NC_039659.1 | 1539 | genus:Triaenodes | Eukaryota; ...; Triaenodini; Triaenodes | sample-3 | abs/path/to/protein_references/sample-3.fasta | abs/path/to/protein_references/sample-3_dna.fasta |
### 'Single-taxid' mode
```
output_dir/
├── nucleotide/ # Nucleotide sequences. Only populated if '--type nucleotide/both' utilised.
│ ├── ACCESSION1_dna.fasta
│ ├── ACCESSION2_dna.fasta
│ └── ...
├── ACCESSION1.fasta # Protein sequences.
├── ACCESSION2.fasta
├── fetched_nucleotide_sequences.csv # Only populated if '--type nucleotide/both' utilised. Sequence metadata.
├── fetched_protein_sequences.csv # Only populated if '--type protein/both' utilised. Sequence metadata.
├── failed_searches.csv # Failed search attempts (if any).
└── gene_fetch.log # Log.
```
**fetched_protein|nucleotide_sequences.csv output example**
| ID | taxid | Description |
| --- | --- | --- |
| PQ645072.1 | 1501 | Ochlerotatus nigripes isolate Pool11 cytochrome c oxidase subunit I (COX1) gene, partial cds; mitochondrial |
| PQ645071.1 | 1537 | Ochlerotatus nigripes isolate Pool10 cytochrome c oxidase subunit I (COX1) gene, partial cds; mitochondrial |
| PQ645070.1 | 1501 | Ochlerotatus impiger isolate Pool2 cytochrome c oxidase subunit I (COX1) gene, partial cds; mitochondrial |
| PQ645069.1 | 1518 | Ochlerotatus impiger isolate Pool1 cytochrome c oxidase subunit I (COX1) gene, partial cds; mitochondrial |
| PP355486.1 | 581 | Aedes scutellaris isolate NC.033 cytochrome c oxidase subunit I (COX1) gene, partial cds; mitochondrial |
## Running gene_fetch on a cluster
- See '1_gene_fetch.sh' for running gene_fetch.py on a HPC cluster (SLURM job schedular).
- Edit 'mem' and/or 'cpus-per-task' to set memory and CPU/threads allocation.
- Change paths and variables as needed.
- Run '1_gene_fetch.sh' with:
```
sbatch 1_gene_fetch.sh
```
## Supported targets
Gene Fetch does function with other targets than those listed below, but it has hard-coded name variations and 'smarter' searching for the below targets. More targets can be added into script (see 'class config').
- cox1/COI/cytochrome c oxidase subunit I
- cox2/COII/cytochrome c oxidase subunit II
- cox3/COIIIcytochrome c oxidase subunit III
- cytb/cob/cytochrome b
- nd1/NAD1/NADH dehydrogenase subunit 1
- nd2/NAD2/NADH dehydrogenase subunit 2
- rbcL/RuBisCO/ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit
- matK/maturase K/maturase type II intron splicing factor
- 16S ribosomal RNA/16s
- SSU/18s
- LSU/28s
- 12S ribosomal RNA/12s
- ITS (ITS1-5.8S-ITS2)
- ITS1/internal transcribed spacer 1
- ITS2/internal transcribed spacer 2
- tRNA-Leucine/trnL
## Benchmarking
| Sample Description | Run Mode | Target | Input File | Data Type | Memory | CPUs | Run Time |
|--------------------|----------|--------|------------|-----------|--------|------|----------|
| 570 Arthropod samples | Normal | COX1 | taxonomy.csv | Both | 10G | 18 | 02:51:06 |
| 570 Arthropod samples | Normal | COX1 | samples.csv | Nucleotide | 5G | 4 | 02:04:01 |
| 570 Arthropod samples | Normal | COX1 | samples.csv | Protein | 5G | 4 | 01:50:31 |
| 570 Arthropod samples | Normal | 18S | samples.csv | Nucleotide | 10G | 8 | 01:38:16 |
| 570 Arthropod samples | Normal | ND1 | samples.csv | Nucleotide | 10G | 4 | 01:58:35 |
| All (159) _A. thaliana_ sequences >300aa | Single-taxid | rbcL | N/A | Protein | 5G | 1 | 00:02:39 |
| 1000 Culicidae sequences >500bp | Single-taxid | COX1 | N/A | nucleotide | 20G | 16 | 00:30:36 |
| 1000 _M. tubercolisis_ sequences | Single-taxid | 16S | N/A | nucleotide | 20G | 16 | 00:10:33 |
## Future Development
- Add optional alignment of retrieved sequences
- Add support for direct GenBank submission format output
- Enhance LRU caching for taxonomy lookups to reduce API calls
- Further improve efficiency of record searching and selecting the longest sequence
- Add support for additional genetic markers beyond the currently supported set
## Contributions and citations
GeneFetch was written by Dan Parsons & Ben Price @ NHMUK (2024).
If you use GeneFetch, please cite our publication: **XYZ()**
If you have any questions or suggested improvements, please do get in touch in the issues!
- Author
- Dan Parsons, Ben Price
- License
- MIT
Contents
Main Workflow:
Gene Fetch
Size: 120945 bytes
Main Workflow Description:
README.md
Size: 12131 bytes