EnrichRBP basic usage flow

This example illustrates the basic usage of EnrichRBP, including loading the dataset, generating features, feature selection, training the model, and performance and feature analysis.

This example uses:

  • EnrichRBP.filesOperation

  • EnrichRBP.Features

  • EnrichRBP.evaluateClassifiers

  • EnrichRBP.metricsPlot

  • EnrichRBP.featureSelection

from EnrichRBP.filesOperation import read_fasta_file, read_label
from EnrichRBP.Features import generateDynamicLMFeatures, generateStaticLMFeatures, generateStructureFeatures, generateBPFeatures
from EnrichRBP.evaluateClassifiers import evaluateDLclassifers
from EnrichRBP.metricsPlot import violinplot, shap_interaction_scatter
from EnrichRBP.featureSelection import cife
from sklearn.svm import SVC

Load the dataset

Load a AGO1 dataset as example.

# Define the path where the dataset locates.
fasta_path = '/home/wangyansong/wangyubo/EnrichRBP/src/RNA_datasets/circRNAdataset/AGO1/seq'
label_path = '/home/wangyansong/wangyubo/EnrichRBP/src/RNA_datasets/circRNAdataset/AGO1/label'

sequences = read_fasta_file(fasta_path)  # Read sequences and labels from given path
label = read_label(label_path)

Generate features for sequences

We generate three types of features as examples, in generating biological features, we generate Positional gapped k-m-tuple pairs (PGKM) features, in generating semantic information, we process the sequence as 4mer in dynamic model, while in static model, we process the sequence as 3mer and use fasttext as the model for embedding extraction.

biological_features = generateBPFeatures(sequences, PGKM=True)  # generate biological features
bert_features = generateDynamicLMFeatures(sequences, kmer=4, model='/home/wangyansong/wangyubo/EnrichRBP/src/dynamicRNALM/circleRNA/pytorch_model_4mer')  # generate dynamic semantic information
static_features = generateStaticLMFeatures(sequences, kmer=3, model='/home/wangyansong/wangyubo/EnrichRBP/src/staticRNALM/circleRNA/circRNA_3mer_fasttext') # static semantic information
structure_features = generateStructureFeatures(fasta_path, script_path='/home/wangyansong/wangyubo/EnrichRBP/src/RNAplfold', basic_path='/home/wangyansong/wangyubo/EnrichRBP/src/circRNAdatasetAGO1', W=101, L=70, u=1)  # generate secondary structure information

Perform feature selection to refine the biological features

We take the cife method as example.

print(biological_features.shape)
refined_biological_features = cife(biological_features, label, num_features=10)  # refine the biologcial_feature using cife feature selection method
print(refined_biological_features.shape)

The output of the above code is as follows:

(34636, 400)
(34636, 10)

Evaluate deep/machine learning classifier

We take the deep learning models as example.

evaluateDLclassifers(bert_features, folds=10, labels=label, file_path='./', shuffle=True) # 10-fold cross-validation of deep learning models using dynamic semantic information

After the function finishes running, a DL_evalution_metrics.csv will be saved in the path specified by file_path, with the following format.

clf_name,metrics,metric_name
CNN,0.999745345,AUC
CNN,0.995525096 ACC
CNN,0.991053703,MCC
CNN,0.994975596,Recall
CNN,0.9955248,F1_Scores
RNN,0.998232352,AUC
RNN,0.988452081,ACC
RNN,0.976941921,MCC
RNN,0.983881982,Recall
RNN,0.988370228,F1_Scores
ResNet,0.999630465,AUC
ResNet,0.995900484,ACC
ResNet,0.991807785,MCC
ResNet,0.994295177,Recall
ResNet,0.995894144,F1_Scores
MLP,0.988894799,AUC
MLP,0.935463968,ACC
MLP,0.867734521,MCC
MLP,0.978133195,Recall
MLP,0.951769181,F1_Scores

Visualize performance and feature analysis

We use the SVM trained with refined_biological_features for feature shap value calculation as an example.

clf = SVC(probability=True)
shap_beeswarm(features, labels, clf, sample_size=(0, 10000), feature_size=(0, 10), image_path=file_path) # Plotting the importance of biological features in SVM using beeswarm plot

An shap_beeswarm.png will be saved under file_path, as follows.

shap_interaction_scatter