Performance and data visualization examples

A variety of visualization functions are integrated in EnrichRBP, which can perform certain correlation analysis on feature data and also visualize the obtained performance index data for plotting.

Performance metrics visualization

In this section, we give examples of performance metrics visualization functions in EnrichRBP, please note that when using these functions, you need to install yellowbrick, seaborn and scikit-learn

Data preparation

Here we prepare the relevant features as well as models to be used in different plot functions.

fasta_path = '/home/wangyansong/wangyubo/EnrichRBP/src/RNA_datasets/circRNAdataset/AGO1/seq'
label_path = '/home/wangyansong/wangyubo/EnrichRBP/src/RNA_datasets/circRNAdataset/AGO1/label'

sequences = read_fasta_file(fasta_path)  # read sequences and labels from given path
label = read_label(label_path)

# Generate dynamic semantic information for training deep learning models
dynamic_semantic_information = generateDynamicLMFeatures(sequences, kmer=4, model='/home/wangyansong/wangyubo/EnrichRBP/src/dynamicRNALM/circleRNA/pytorch_model_4mer')

# Generate biological features for training machine learning classifiers
biological_features = generateBPFeatures(sequences, PGKM=True)

# create CNN and RNN models for plots.
CNN_model = createCNN(dynamic_semantic_information.shape[1], dynamic_semantic_information.shape[2])
RNN_model = createRNN(dynamic_semantic_information.shape[1], dynamic_semantic_information.shape[2])

# create several machine learning classifiers for plots.
ML_Classifiers = [
    LogisticRegression(max_iter=10000),
    KNeighborsClassifier(),
    DecisionTreeClassifier(),
    GaussianNB(),
    SVC(probability=True)
]
# We use as an example the performance data obtained in the previous evaluation of machine learning classifiers
ml_metric_data = pd.read_csv('/home/wangyansong/wangyubo/EnrichRBP/src/ML_evalution_metrics.csv')

callbacks = [EarlyStopping(monitor='val_loss', patience=5, verbose=2, mode='min', restore_best_weights=True)]
labels_2D = to_categorical(label)

The format of the metric file is as follows:

                      clf_name    values metric_name
0           LogisticRegression  0.757671         AUC
1           LogisticRegression  0.686194         ACC
2           LogisticRegression  0.372397         MCC
3           LogisticRegression  0.690023      Recall
4           LogisticRegression  0.687357   F1_Scores
5         KNeighborsClassifier  0.708609         AUC
6         KNeighborsClassifier  0.651634         ACC
7         KNeighborsClassifier  0.303814         MCC
8         KNeighborsClassifier  0.622109      Recall
9         KNeighborsClassifier  0.641023   F1_Scores
10      DecisionTreeClassifier  0.583520         AUC
11      DecisionTreeClassifier  0.583583         ACC
12      DecisionTreeClassifier  0.167029         MCC
13      DecisionTreeClassifier  0.583673      Recall
14      DecisionTreeClassifier  0.583596   F1_Scores
15                  GaussianNB  0.724388         AUC
16                  GaussianNB  0.662461         ACC
17                  GaussianNB  0.326168         MCC
18                  GaussianNB  0.703995      Recall
19                  GaussianNB  0.675895   F1_Scores
20           BaggingClassifier  0.699751         AUC
21           BaggingClassifier  0.642049         ACC
22           BaggingClassifier  0.286901         MCC
23           BaggingClassifier  0.573204      Recall
24           BaggingClassifier  0.615563   F1_Scores
25      RandomForestClassifier  0.766152         AUC
26      RandomForestClassifier  0.693585         ACC
27      RandomForestClassifier  0.387366         MCC
28      RandomForestClassifier  0.710193      Recall
29      RandomForestClassifier  0.698591   F1_Scores
30          AdaBoostClassifier  0.742326         AUC
31          AdaBoostClassifier  0.675107         ACC
32          AdaBoostClassifier  0.350416         MCC
33          AdaBoostClassifier  0.690847      Recall
34          AdaBoostClassifier  0.680126   F1_Scores
35  GradientBoostingClassifier  0.764653         AUC
36  GradientBoostingClassifier  0.690264         ACC
37  GradientBoostingClassifier  0.381289         MCC
38  GradientBoostingClassifier  0.716291      Recall
39  GradientBoostingClassifier  0.698100   F1_Scores
40                         SVM  0.804761         AUC
41                         SVM  0.727653         ACC
42                         SVM  0.455588         MCC
43                         SVM  0.745526      Recall
44                         SVM  0.732425   F1_Scores
45  LinearDiscriminantAnalysis  0.758004         AUC
46  LinearDiscriminantAnalysis  0.687464         ACC
47  LinearDiscriminantAnalysis  0.375057         MCC
48  LinearDiscriminantAnalysis  0.691123      Recall
49  LinearDiscriminantAnalysis  0.688563   F1_Scores
50        ExtraTreesClassifier  0.768708         AUC
51        ExtraTreesClassifier  0.695433         ACC
52        ExtraTreesClassifier  0.391130         MCC
53        ExtraTreesClassifier  0.710470      Recall
54        ExtraTreesClassifier  0.699929   F1_Scores

violin plot

This example shows how to use the EnrichRBP.metricsPlot module to plot violin figure.

# The x-axis is divided according to clf_name, and the various performance metrics are put together on the y-axis to draw a violin plot
violinplot(ml_metric_data, x_id='clf_name', y_id='values', image_path='./')

After the function finishes running, it will save a violinplot.png file in the path specified by image_path, as follows:

violinplot

box plot

This example shows how to use the EnrichRBP.metricsPlot module to plot box figure.

# The x-axis is divided according to clf_name, and the various performance metrics are put together on the y-axis to draw a box plot
boxplot(ml_metric_data, x_id='clf_name', y_id='values', image_path='./')

After the function finishes running, it will save a boxplot.png file in the path specified by image_path, as follows:

boxplot

point plot

This example shows how to use the EnrichRBP.metricsPlot module to plot point figure.

# The x-axis is divided according to clf_name, and the various performance metrics are put together on the y-axis to draw a point plot
pointplot(ml_metric_data, x_id='clf_name', y_id='values', image_path='./')

After the function finishes running, it will save a pointplot.png file in the path specified by image_path, as follows:

pointplot

bar plot

This example shows how to use the EnrichRBP.metricsPlot module to plot bar figure.

# The x-axis is divided according to clf_name, and the various performance metrics are put together on the y-axis to draw a box plot
barplot(ml_metric_data, x_id='clf_name', y_id='values', image_path='./')

After the function finishes running, it will save a barplot.png file in the path specified by image_path, as follows:

barplot

Plot roc curve

This example shows how to use the EnrichRBP.metricsPlot module to plot the roc curve.

Deep learning models

label_list = []
pred_proba_list = []
name_list = ['CNN', 'RNN']

# Divide the features into training and test sets in the ratio of 3:1
X_train, test_X, y_train, test_y = train_test_split(dynamic_semantic_information, labels_2D, test_size=0.25, random_state=6)

# Take 10% from the training set as the validation set
train_X, val_X, train_y, val_y = train_test_split(X_train, y_train, test_size=0.1, random_state=6)

# train CNN and RNN models
CNN_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
CNN_model.fit(x=train_X, y=train_y, epochs=30, batch_size=64, verbose=0, shuffle=True, callbacks=callbacks,
                      validation_data=(val_X, val_y))
pre_proba_CNN = CNN_model.predict(test_X)[:, 1]
test_y1 = test_y[:, 1]
label_list.append(test_y1)
pred_proba_list.append(pre_proba_CNN)

RNN_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
RNN_model.fit(x=train_X, y=train_y, epochs=30, batch_size=64, verbose=0, shuffle=True, callbacks=callbacks,
                      validation_data=(val_X, val_y))
pre_proba_RNN = RNN_model.predict(test_X)[:, 1]
test_y2 = test_y[:, 1]
label_list.append(test_y2)
pred_proba_list.append(pre_proba_RNN)

# plot the roc curve
roc_curve_deeplearning(label_list=label_list, pred_proba_list=pred_proba_list, name_list=name_list, image_path='./')

After the function finishes running, it will save a roc_curve.png file in the path specified by image_path, as follows:

roc_curve_deeplearning

Machine learning classifiers

In the machine learning plotting process, we don’t need to train the classifiers manually, we just need to pass the feature matrix, labels and classifiers into the function.

# Using the previously created set of classifiers and the biological feature matrix, the test set ratio is set to 0.25 for roc curve plotting.
roc_curve_machinelearning(biological_features, label, ML_Classifiers, image_path='./', test_size=0.25, random_state=6)

After the function finishes running, it will save a roc_curve.png file in the path specified by image_path, as follows:

roc_curve_machinelearning

Plot confusion matrix

This example shows how to use the EnrichRBP.metricsPlot module to plot the confusion matrix.

Deep learning models

# Divide the features into training and test sets in the ratio of 3:1
X_train, test_X, y_train, test_y = train_test_split(dynamic_semantic_information, label, test_size=0.25, random_state=6)

# Take 10% from the training set as the validation set
train_X, val_X, train_y, val_y = train_test_split(X_train, y_train, test_size=0.1, random_state=6)

# train CNN model for example
CNN_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
CNN_model.fit(x=train_X, y=train_y, epochs=30, batch_size=64, verbose=0, shuffle=True, callbacks=callbacks,
                      validation_data=(val_X, val_y))
pre_proba_CNN = CNN_model.predict(test_X)
pred_labels = np.argmax(pre_proba_CNN, axis=1)
test_labels = test_y[:, 1]

# plot the confusion matrix
confusion_matirx_deeplearning(test_labels=test_labels, pred_labels=pred_labels, image_path='./')

After the function finishes running, it will save a confusion_matrix.png file in the path specified by image_path, as follows:

confusion_matrix_deeplearning

Machine learning classifiers

# select the LogisticRegression for example
clf = ML_Classifiers[0]

# the test set ratio is set to 0.25 for plotting confusion matrix
confusion_matrix_machinelearning(clf, biological_features, label, test_size=0.25, normalize=None, random_state=6, image_path='./')

After the function finishes running, it will save a without_normalization_confusionMatrix.png file in the path specified by image_path, as follows:

confusion_matrix_ML_without_normalization

When normalize is set to ‘true’, ‘pred’ or ‘all’, the resulting image is as follows (file name is normalization_confusionMatrix.png):

confusion_matrix_ML_normalization

Plot det curve

This example shows how to use the EnrichRBP.metricsPlot module to plot the det curve.

Deep learning models

label_list = []
pred_proba_list = []
name_list = ['CNN', 'RNN']

# Divide the features into training and test sets in the ratio of 3:1
X_train, test_X, y_train, test_y = train_test_split(dynamic_semantic_information, labels_2D, test_size=0.25, random_state=6)

# Take 10% from the training set as the validation set
train_X, val_X, train_y, val_y = train_test_split(X_train, y_train, test_size=0.1, random_state=6)

# train CNN and RNN models
CNN_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
CNN_model.fit(x=train_X, y=train_y, epochs=30, batch_size=64, verbose=0, shuffle=True, callbacks=callbacks,
                      validation_data=(val_X, val_y))
pre_proba_CNN = CNN_model.predict(test_X)[:, 1]
test_y1 = test_y[:, 1]
label_list.append(test_y1)
pred_proba_list.append(pre_proba_CNN)

RNN_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
RNN_model.fit(x=train_X, y=train_y, epochs=30, batch_size=64, verbose=0, shuffle=True, callbacks=callbacks,
                      validation_data=(val_X, val_y))
pre_proba_RNN = RNN_model.predict(test_X)[:, 1]
test_y2 = test_y[:, 1]
label_list.append(test_y2)
pred_proba_list.append(pre_proba_RNN)

# plot the det curve
det_curve_deeplearning(label_list, pred_proba_list, name_list, image_path='./')

After the function finishes running, it will save a det_curve.png file in the path specified by image_path, as follows:

det_curve_deeplearning

Machine learning classifiers

In the machine learning plotting process, we don’t need to train the classifiers manually, we just need to pass the feature matrix, labels and classifiers into the function.

det_curve_machinelearning(biological_features, label, ML_Classifiers, image_path='./', test_size=0.25, random_state=6)

After the function finishes running, it will save a det_curve.png file in the path specified by image_path, as follows:

det_curve_machinelearning

Plot precision recall curve

This example shows how to use the EnrichRBP.metricsPlot module to plot the precision recall curve.

Deep learning models

label_list = []
pred_label_list = []
name_list = ['CNN', 'RNN']

# Divide the features into training and test sets in the ratio of 3:1
X_train, test_X, y_train, test_y = train_test_split(dynamic_semantic_information, labels_2D, test_size=0.25, random_state=6)

# Take 10% from the training set as the validation set
train_X, val_X, train_y, val_y = train_test_split(X_train, y_train, test_size=0.1, random_state=6)

# train CNN and RNN models
CNN_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
CNN_model.fit(x=train_X, y=train_y, epochs=30, batch_size=64, verbose=0, shuffle=True, callbacks=callbacks,
                      validation_data=(val_X, val_y))
pre_proba_CNN = CNN_model.predict(test_X)
test_y1 = test_y[:, 1]
label_list.append(test_y1)
pred_label_list.append(np.argmax(pre_proba_CNN, axis=1))

RNN_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
RNN_model.fit(x=train_X, y=train_y, epochs=30, batch_size=64, verbose=0, shuffle=True, callbacks=callbacks,
                      validation_data=(val_X, val_y))
pre_proba_RNN = RNN_model.predict(test_X)
test_y2 = test_y[:, 1]
label_list.append(test_y2)
pred_label_list.append(np.argmax(pre_proba_RNN, axis=1))

# plot the precision recall curve
precision_recall_curve_deeplearning(label_list, pred_label_list, name_list, image_path='./')

After the function finishes running, it will save a precision_recall_curve.png file in the path specified by image_path, as follows:

precision_recall_curve_deeplearning

Machine learning models

precision_recall_curve_machinelearning(biological_features, label, ML_Classifiers, image_path='./', test_size=0.25, random_state=6)

After the function finishes running, it will save a precision_recall_curve.png file in the path specified by image_path, as follows:

precision_recall_curve_machinelearning

Plot partial dependence

This example shows how to use the EnrichRBP.metricsPlot module to plot the partial dependence.

clf = ML_Classifiers[0]

# Plot the first six dimensions of features, and if you use a dataframe, you can specify specific feature names
partial_dependence(biological_features, label, clf, feature_names=[0, 1, 2, 3, 4, 5], image_path='./', random_state=6)

After the function finishes running, it will save a partial_dependence.png file in the path specified by image_path, as follows:

partial_dependence

Note

Currently this function is only available for machine learning classifiers, please look forward to subsequent implementations for deep learning models.

Plot prediction error bar

This example shows how to use the EnrichRBP.metricsPlot module to plot prediction error bar figure.

clf = ML_Classifiers[-1]
prediction_error(biological_features, label, classes=['positive', 'negative'], clf=clf, test_size=0.25, random_state=6, image_path='./')

After the function finishes running, it will save a prediction_error.png file in the path specified by image_path, as follows:

prediction_error

Plot descrimination threshold

This example shows how to use the EnrichRBP.metricsPlot module to plot descrimination threshold figure.

clf = ML_Classifiers[-1]
descrimination_threshold(biological_features, label, clf, image_path='./')

After the function finishes running, it will save a descrimination_threshold.png file in the path specified by image_path, as follows:

descrimination_threshold

Plot learning curve

This example shows how to use the EnrichRBP.metricsPlot module to plot learning curve figure.

clf = ML_Classifiers[-1]
folds = 5
learning_curve(biological_features, label, folds, clf, image_path='./')

After the function finishes running, it will save a learning_curve.png file in the path specified by image_path, as follows:

learning_curve

Plot cross validation score

This example shows how to use the EnrichRBP.metricsPlot module to plot cross validation score figure.

clf = ML_Classifiers[-1]
folds = 5
cross_validation_score(folds=folds, clf=clf, features=biological_features, labels=label, image_path='./')

After the function finishes running, it will save a cv_score.png file in the path specified by image_path, as follows:

cv_score

Feature analysis plot

The functions in this section currently only support machine learning classifiers, and the implementation of deep learning models is still in progress, so please look forward to subsequent versions.

Shap bar plot

This example shows how to use the EnrichRBP.metricsPlot module to plot shap bar figure.

clf = ML_Classifiers[0]

# The shap bar is plotted using logistic regression, where the first 100 samples, and the first 10 dimensional features are selected for the shap value calculation。
shap_bar(biological_features, label, clf, sample_size=(0, 100), feature_size=(0, 10), image_path='./')

After the function finishes running, it will save a shap_bar.png file in the path specified by image_path, as follows:

shap_bar

shap scatter plot

This example shows how to use the EnrichRBP.metricsPlot module to plot shap scatter figure.

clf = ML_Classifiers[-1]
shap_scatter(biological_features, label, clf, feature_id=3, sample_size=(0, 100), feature_size=(0, 10), image_path='./')

After the function finishes running, it will save a shap_scatter.png file in the path specified by image_path, as follows:

shap_scatter

shap waterfall plot

This example shows how to use the EnrichRBP.metricsPlot module to plot shap waterfall figure.

clf = ML_Classifiers[-1]
shap_waterfall(biological_features, label, clf, feature_id=2, sample_size=(0, 100), feature_size=(0, 10), image_path='./')

After the function finishes running, it will save a shap_waterfall.png file in the path specified by image_path, as follows:

shap_waterfall

shap interaction scatter plot

This example shows how to use the EnrichRBP.metricsPlot module to plot shap interaction scatter figure.

clf = ML_Classifiers[-1]
shap_interaction_scatter(biological_features, label, clf, sample_size=(0, 100), feature_size=(0, 10), image_path='./')

After the function finishes running, it will save a shap_interaction_scatter.png file in the path specified by image_path, as follows:

shap_interaction_scatter

shap beeswarm plot

This example shows how to use the EnrichRBP.metricsPlot module to plot shap beeswarm figure.

clf = ML_Classifiers[-1]
shap_beeswarm(biological_features, label, clf, sample_size=(0, 100), feature_size=(0, 10), image_path='./')

After the function finishes running, it will save a shap_beeswarm.png file in the path specified by image_path, as follows:

shap_beeswarm

shap heatmap plot

This example shows how to use the EnrichRBP.metricsPlot module to plot shap heatmap figure.

clf = ML_Classifiers[-1]
shap_heatmap(biological_features, label, clf, sample_size=(0, 100), feature_size=(0, 10), image_path='./')

After the function finishes running, it will save a shap_heatmap.png file in the path specified by image_path, as follows:

shap_heatmap

Note

The process of ploting the image is very time consuming because the training of shap explainer is required to plot the figure for shap feature analysis, please be patient.

feature heatmap plot

This example shows how to use the EnrichRBP.metricsPlot module to plot feature heatmap figure.

# The x-axis is divided according to clf_name, and the various performance metrics are put together on the y-axis to draw a box plot
sns_heatmap(biological_features, sample_size=(0, 15), feature_size=(0, 15), image_path='./')

After the function finishes running, it will save a sns_heatmap.png file in the path specified by image_path, as follows:

feature_heatmap