Mini tutorial for the randomGLM R package
Lin Song, Steve Horvath
In this mini tutorial, we briefly show how to fit an RGLM predictor. We illustrate it using the brain cancer gene expression data used in [1].
1. Binary outcome prediction
library(randomGLM)
# load data
data(mini)
# training data set whose columns contain the features (here genes)
x = mini$x
# Since some of the column names are not unique we change them
colnames(x) = make.names(colnames(x), unique = TRUE)
# outcome of the training data set
y = mini$yB
table(y)
Output:
y
0 1
28 27
# test data set whose columns must equal those of the training data set
xtest = mini$xtest
# make sure that the column names agree...
colnames(xtest) = make.names(colnames(xtest), unique = TRUE)
# outcome of the test data set
ytest = mini$yBtest
table(ytest)
Output:
y
ytest
0 1
33 32
# Fit the RGLM predictor assuming that your computer has only 1 core (nThreads=1). Here we use the default values of all RGLM parameters. To learn how to choose different parameter values, please consider the tutorial RGLMparameterTuningTutorial.docx which is posted on our webpage.
http://labs.genetics.ucla.edu/horvath/RGLM/
RGLM = randomGLM(x, y, xtest, classify=TRUE, keepModels=TRUE, nThreads=1)
# out-of-bag prediction on the training data at the level of the response
predictedOOB = RGLM$predictedOOB
table(y, predictedOOB)
Output:
predictedOOB
y 0 1
0 24 4
1 6 21
Message: The OOB prediction is wrong for 4+6=10 observation.
Thus, the OOB estimate of the error rate is (4+6)/length(y)= 0.181818
The OOB estimate of the accuracy is 1-0.181818=0.818181
# This is the test set prediction
predictedTest = RGLM$predictedTest
table(ytest, predictedTest)
Output:
predictedTest
ytest 0 1
0 28 5
1 3 29
Message: The test set prediction is wrong for 3+5=8 observation.
Thus, the test set error rate is (3+5)/length(ytest)= 0.1230769
The test set estimate of the accuracy is 1- 0.1230769=0.8769231
# Class probabilities in the test set.
predictedTest.response = RGLM$predictedTest.response
predictedTest.response
Output:
0 1
dat65_21484_21474 4.206930e-01 0.57930700
dat65_21486_21475 2.898402e-01 0.71015982
dat65_21488_21476 3.814986e-01 0.61850144
dat65_21490_21477 6.999995e-02 0.93000005
...... ETC
Message:
For each test set observation (rows) the output reports the probability of being class 0 or class 1. By thresholding these values, one obtains the predicted class outcome reported in RGLM$predictedTest. To choose a different threshold in the randomGLM function, consider the RGLM parameter thresholdClassProb (default value 0.5).
# variable importance measures
varImp = RGLM$timesSelectedByForwardRegression
# Create a data frame that reports the variable importance measure of each feature.
datvarImp=data.frame(Feature=as.character(dimnames(RGLM$timesSelectedByForwardRegression)[[2]]),timesSelectedByForwardRegression= as.numeric(RGLM$timesSelectedByForwardRegression))
#Report the 20 most significant features
datvarImp[rank(-datvarImp[,2],ties.method="first")<=20,]
Output:
Feature timesSelectedByForwardRegression
214 200839_s_at 10
299 200986_at 3
452 201280_s_at 4
973 202291_s_at 9
1141 202625_at 7
1224 202803_s_at 5
1285 202953_at 3
1711 204174_at 3
1860 204829_s_at 4
1903 205105_at 3
2210 208306_x_at 6
2631 209619_at 5
3000 212203_x_at 4
3622 215049_x_at 4
3781 217362_x_at 9
4134 218217_at 5
4145 218232_at 9
4607 220856_x_at 5
4626 220980_s_at 5
4914 38487_at 5
# Barplot of the importance measures for the 20 most important features
datVarImpSelected=datvarImp[rank(-datvarImp[,2],ties.method="first")<=20, ]
par(mfrow=c(1,1), mar=c(4,8,3,1))
barplot(datVarImpSelected[,2],horiz=TRUE,names.arg= datVarImpSelected[,1],xlab="Feature Importance",las=2,main="Most significant features for the RGLM",cex.axis=1,cex.main=1.5,cex.lab=1.5)
# indices of features selected into the model in bag 1
RGLM$featuresInForwardRegression[[1]]
Output
X200660_at X202291_s_at X212145_at
Feature.1 119 973 2974
# Model coefficients in bag 1.
coef(RGLM$models[[1]])
Output
(Intercept) X200660_at X202291_s_at X212145_at
7979.738216 2.002009 5.220940 -36.515803
2. Quantitative, continuous outcome prediction
library(randomGLM)
# load the data (they are part of the randomGLM package).
data(mini)
# prepare the training data
x = mini$x
# Since some of the column names are not unique we change them
colnames(x) = make.names(colnames(x), unique = TRUE)
y = mini$yC
# prepare the test data
xtest = mini$xtest
colnames(xtest) = make.names(colnames(xtest), unique = TRUE)
ytest = mini$yCtest
# Fit the RGLM predictor
RGLM = randomGLM(x, y, xtest, classify=FALSE, keepModels=TRUE, nThreads=1)
# out-of-bag prediction at the level of response
predictedOOB.response = RGLM$predictedOOB.response
# test set prediction
predictedTest.response = RGLM$predictedTest.response
# variable importance measures
varImp = RGLM$timesSelectedByForwardRegression
# Barplot of the importance measures for the 20 most important features
datvarImp=data.frame(Feature=as.character(dimnames(RGLM$timesSelectedByForwardRegression)[[2]]),timesSelectedByForwardRegression= as.numeric(RGLM$timesSelectedByForwardRegression))
datVarImpSelected=datvarImp[rank(-datvarImp[,2],ties.method="first")<=20, ]
par(mfrow=c(1,1), mar=c(4,8,3,1))
barplot(datVarImpSelected[,2],horiz=TRUE,names.arg= datVarImpSelected[,1],xlab="Feature Importance",las=2,main="Most significant features for the RGLM",cex.axis=1,cex.main=1.5,cex.lab=1.5)
# indices of features selected into the model of bag 1
RGLM$featuresInForwardRegression[[1]]
Output
X218232_at X203175_at X200986_at X216913_s_at X32128_at X208451_s_at X216231_s_at X208885_at
Feature.1 4145 1378 299 3757 4865 2227 3713 2353
X220856_x_at X212588_at X202625_at X218831_s_at X208961_s_at X201041_s_at X201850_at X212119_at
Feature.1 4607 3155
ETC
# Model coefficients of bag 1
coef(RGLM$models[[1]])
Output
(Intercept) X218232_at X203175_at X200986_at X216913_s_at X32128_at X208451_s_at
-4.038310e+02 1.350274e-01 8.108334e-02 -3.471803e-02 2.447514e-01 8.666125e-01 7.484849e-02
X216231_s_at X208885_at X220856_x_at X212588_at X202625_at X218831_s_at X208961_s_at
-4.420324e-02 -3.588560e-02 1.852245e-02 -4.221644e-01 2.107064e-01 -2.589818e-01 1.694786e-01
X201041_s_at X201850_at X212119_at X201954_at X204682_at X204053_x_at X201887_at
-3.752358e-02 6.106237e-02 1.290596e-01 -2.070730e-01 -2.556034e-01 3.274343e-01 -3.154036e-01
X200660_at X214836_x_at X217947_at X211990_at X200620_at X202253_s_at X202237_at
1.046733e-01 -3.279971e-02 -8.911381e-02 9.909101e-02 2.114678e-01 4.784094e-01 1.220026e-01
X211963_s_at X213975_s_at X202833_s_at X201438_at X218473_s_at X208894_at X219059_s_at
-9.914986e-02 3.551280e-02 -1.968306e-01 3.784942e-02 -7.550249e-02 2.204820e-02 -1.191214e-01
X219505_at X202353_s_at X203882_at X217748_at X215121_x_at
9.010647e-02 -6.963570e-02 -8.129762e-03 3.327119e-02 2.239568e-03
References
1. Song L, Langfelder P, Horvath S (2013) Random generalized linear model: a highly accurate and interpretable ensemble predictor. BMC Bioinformatics 14:5 PMID: 23323760DOI: 10.1186/1471-2105-14-5.
1