PanGIA
main @ daca690

Workflow Type: Python

PanGIA: A universal framework for identifying association between ncRNAs and diseases

PanGIA is a deep learning model for predicting ncRNA-disease associations.

Model Architecture

Installation

conda create -n pangia python=3.11
conda activate pangia
pip install -r requirements.txt

Prepare Datasets

The raw data can be downloaded from the following sources:

  • miRNA: The associations between miRNAs and diseases were obtained from the HMDD v4.0 database, while the sequence information of miRNAs was retrieved from the miRBase database.

  • LncRNA/circRNA: This study includes lncRNA and circRNA associations with diseases, with data obtained from LncRNADisease v3.0. The sequence information of circRNAs was retrieved from the circBase database. In contrast, lncRNA sequences were collected from two sources: GENCODE and NONCODE.

  • piRNA: The associations between piRNAs and diseases were obtained from the piRDisease v1.0 database, and the sequence information was retrieved from the piRBase and piRNAdb databases.

  • Disease: This study utilizes Disease Ontology Identifiers (DOIDs) to construct the disease similarity matrix, with corresponding information obtained from the Disease Ontology database.

These data are also organized in the ./data folder.

Quick Start

1.Data Preprocessing & Cleaning

Prepare the RNA sequence files for each RNA type (miRNA, piRNA, lncRNA, circRNA) in CSV format:

# Example format (no header):
# RNA_ID,Sequence
miR0001,AGCUUGGA...
miR0002,CGAUUAGC...

Run the script to perform global alignment of RNA sequences and compute their pairwise similarity:

python compute_RNA_similarity.py

Next, merge the RNA sequence similarity matrices across all RNA types (miRNA, piRNA, lncRNA, circRNA) into a unified format for downstream analysis:

python merge_RNA_similarity_matrices.py

This script reads the normalized pairwise similarity matrices generated for each RNA type and combines them into a multi-view or unified similarity representation for further modeling.

Next, run the script compute_disease_similarity.py to generate the disease ontology-based similarity matrix:

python compute_disease_similarity.py

This script calculates pairwise semantic similarities between diseases based on the Disease Ontology (DO) structure, and saves the resulting matrix to:

./data/d2d_do.csv

Run the following script to generate the binary association matrix between ncRNAs and diseases:

python generate_RD_adj.py

This script constructs the ncRNA–disease adjacency matrix based on known associations.
The output is a matrix where each row represents an ncRNA and each column represents a disease,
with entries marked as 1 if an association exists, and 0 otherwise.

Next, pretrain Word2Vec embeddings for RNA k-mer segments using the following script:

python pretrain_RNA_kmer.py

This script tokenizes RNA sequences into k-mers, performs sliding-window segmentation, pads them to a unified length, and trains Word2Vec embeddings for each RNA type (miRNA, circRNA, lncRNA, piRNA).
The output includes:

  • gensim_feat__.npy: A dictionary containing
    • k-mer embedding matrix
    • padded k-mer ID sequences
    • segment-to-sequence mapping

Run the programs in the build dataset folder sequentially to generate the cross-validation dataset.

2.Model Training

Use the processed similarity matrices and datasets to train the model:

python main.py

Due to the high memory/GPU usage of the network, please pass parameters when running main.py to control the network size according to your own computational resources.

Version History

main @ daca690 (earliest) Created 19th Aug 2025 at 13:51 by qiankunzizairen Liu

Delete .DS_Store


Frozen main daca690
help Creators and Submitter
Creators
Not specified
Submitter
Activity

Views: 387   Downloads: 102

Created: 19th Aug 2025 at 13:51

help Tags

This item has not yet been tagged.

help Attributions

None

Total size: 1020 KB
Powered by
(v.1.17.1)
Copyright © 2008 - 2025 The University of Manchester and HITS gGmbH