prioritylasso was developed for situations in which different types
of high dimensional omics-data in combination with clinical data are
available. We want to include these so called blocks of data together in
a prediction model. prioritylasso thereby tries to find a good
compromise between prediction accuracy and practical aspects. This is
done by letting the practitioner include prior knowledge in the form of
priorities, which define the importance of every block. The priorities
can be chosen in several different ways.
The model is successively calculated as LASSO-based, while the Linear
Predictor after each fit is taken as an Offset in the LASSO Regression
of the block with the next lowest priority. The Penalized Regression
models are computed via the R package glmnet
. Moreover, the
R package survival
is used when the outcome consists of
survival data. In addition to the main function, the package
prioritylasso
contains an extension of the standard
function named cvm_prioritylasso. It is also explained under practical
aspects in the following examples.
The data we will use for the examples was simulated and shall represent AML data with 4 blocks of variables. We first have to load the package which contains this data set.
To take a deeper look at the data, open the description through
?pl_data
. The different types of data are all stored
together. To get and define the block structure which implies the
priorities, we look at the names of the variables and the matrix
dimension.
## [1] 400 1029
## [1] "b1.1" "b1.2" "b1.3" "b1.4" "b2.1" "b2.2" "b2.3" "b2.4" "b2.5"
## [10] "b3.1" "b3.2" "b3.3" "b3.4" "b3.5" "b3.6" "b3.7" "b3.8" "b3.9"
## [19] "b3.10" "b3.11" "b3.12" "b3.13" "b3.14" "b3.15" "b3.16" "b3.17" "b3.18"
## [28] "b3.19" "b4.1" "b4.2"
## [1] "b4.997" "b4.998" "b4.999" "b4.1000" "pl_out"
We see that the last column contains the outcome. We will re-save the
outcome and the matrix of predictors separately and define the argument
blocks
afterwards. We will need that for the application of
prioritylasso where the indices have to correspond with the variables in
the predictor matrix.
pl_out <- pl_data[,1029]
pl_pred <- pl_data[,1:1028]
blocks <- list(bp1=1:4, bp2=5:9, bp3=10:28, bp4=29:1028)
For an easier interpretation, we can consider the first block as
clinical variables of most importance, the second block as other
clinical data, and the third block as mutations. These blocks are all
low-dimensional. That will later be an advantage regarding computation
time. Another type of data is high dimensional gene expression data.
Moreover, pl_out
consists of a binary outcome for 400
subjects. Before we go further we want to split the data into training
and validation set. The training set should be used to build the model
with prioritylasso while the validation data should later be used to
assess the prediction accuracy in an independent data set. Here we just
do a random split such that 2/3 of the data belongs to the training set
and 1/3 to the validation set.
set.seed(1234)
label <- sample(dim(pl_pred)[1],round(dim(pl_pred)[1]*(2/3)))
pl_train <- pl_pred[label,]
pl_val <- pl_pred[-label,]
pl_out_train <- pl_out[label]
pl_out_val <- pl_out[-label]
We can now run prioritylasso for the first time.
set.seed(1234)
pl1 <- prioritylasso(X = pl_train, Y = pl_out_train, family = "binomial", type.measure = "auc",
blocks = blocks, standardize = FALSE)
Because we have a binary outcome, we specified
family = "binomial"
and type.measure = "auc"
.
The priority structure was given through the block definition in the
data set. An extended use of the block definition and the general
functionality of prioritylasso will be given in the following
sections.
If we want to specify another block order, we have to redefine the
argument blocks
. Let us assume that we do not want to
include the variables b1 first in the model, but b3 first, b1 second and
b2 third. That would be done by
blocks = list(bp1 = 10:28, bp2 = 1:4, bp3 = 5:9, bp4 = 29:1028)
.
So the position on the list indicates when the variables are considered
in the model or in other words, the priority. It is not necessary to
name the entries of the list, but it may help in avoiding confusion. We
chose the names such that “bp1”” means “block of priority 1”.
set.seed(1234)
pl2 <- prioritylasso(X = pl_train, Y = pl_out_train, family = "binomial", type.measure = "auc",
block1.penalization = TRUE, blocks = list(10:28, 1:4, 5:9, 29:1028),
standardize = FALSE)
pl2$nzero
## [[1]]
## [1] 17
##
## [[2]]
## [1] 4
##
## [[3]]
## [1] 2
##
## [[4]]
## [1] 23
The output lists are to be interpreted according to the definition in
blocks
. Therefore, the first entry corresponds to the
variables of columns 10:28 in the data set.
As we can see in pl2, we have a lot of nonzero coefficients of bp4,
although it is the block with lowest priority. In general, there should
be a reason why the glmnet procedure chooses these coefficients. On the
other hand, it is sometimes not appropriate from a practical point of
view that the number exceeds a particular value. Therefore we made it
possible to set a maximal number of nonzero coefficients. Because we
just want to set a limit for bp4, we set the other entries to
Inf
.
set.seed(1234)
pl3 <- prioritylasso(X = pl_train, Y = pl_out_train, family = "binomial", type.measure = "auc",
block1.penalization = TRUE, blocks = list(10:28, 1:4, 5:9, 29:1028),
max.coef = c(Inf, Inf, Inf, 10), standardize = FALSE)
pl3$nzero
## [[1]]
## [1] 17
##
## [[2]]
## [1] 4
##
## [[3]]
## [1] 2
##
## [[4]]
## [1] 10
We can further diversify the option of whether or not bp1 is included
without a penalty in the model. In a not penalized version, the block
has to be low-dimensional and the model includes all of its variables.
In this case, the model fit is stored in block1unpen
.
Otherwise, the block is treated like the other blocks and a LASSO model
is fitted.
set.seed(1234)
pl4 <- prioritylasso(X = pl_train, Y = pl_out_train, family = "binomial", type.measure = "auc",
block1.penalization = FALSE, blocks = list(1:4, 5:9, 10:28, 29:1028),
max.coef = c(Inf, Inf, Inf, 10), standardize = FALSE)
pl4$block1unpen
##
## Call: glm(formula = Y[current_observations] ~ X[current_observations,
## blocks[[1]]], family = family, weights = weights[current_observations])
##
## Coefficients:
## b1.1 b1.2 b1.3 b1.4
## 0.1683 1.0135 0.9081 1.0682 0.3760
##
## Degrees of Freedom: 266 Total (i.e. Null); 262 Residual
## Null Deviance: 351
## Residual Deviance: 336.8 AIC: 346.8
## [[1]]
## NULL
##
## [[2]]
## [1] 5
##
## [[3]]
## [1] 10
##
## [[4]]
## [1] 14
The first entries of the values which correspond to the
blocks
list remain empty as you can see for the example of
lambda.ind
.
nfolds
and lambda.type
are options which
refer to the cross-validation procedure in cv.glmnet
.
nfolds
specifies the number of folds in the procedure while
lambda.type
handles the amount of cross-validation error.
It can be set to either lambda.min
, which is also the
default, or lambda.1se
. lambda.min
gives the
result with minimum mean cross-validation error, whereas
lambda.1se
gives the result such that the cross-validation
error is within 1 standard error of the minimum, and thus leads to more
sparse results. Note that the latter can only be chosen in combination
with no restrictions in max.coef
.
set.seed(1234)
pl5_min <- prioritylasso(X = pl_train, Y = pl_out_train, family = "binomial",
type.measure = "auc", block1.penalization = TRUE,
blocks = list(1:4, 5:9, 10:28, 29:1028),
lambda.type = "lambda.min", standardize = FALSE)
set.seed(1234)
pl5_1se <- prioritylasso(X = pl_train, Y = pl_out_train, family = "binomial",
type.measure = "auc", block1.penalization = TRUE,
blocks = list(1:4, 5:9, 10:28, 29:1028),
lambda.type = "lambda.1se", standardize = FALSE)
pl5_min$nzero
## [[1]]
## [1] 4
##
## [[2]]
## [1] 1
##
## [[3]]
## [1] 5
##
## [[4]]
## [1] 55
## [[1]]
## [1] 3
##
## [[2]]
## [1] 0
##
## [[3]]
## [1] 0
##
## [[4]]
## [1] 16
In addition, other options for prioritylasso can be specified. For
example, we can change the type of cross validation measurement for
binary outcome to the misclassification error. Examples for other
outcome variables are shown in the function documentation. Further
arguments can be passed to prioritylasso which are used in
cv.glmnet
, e.g. the elasticnet mixing parameter. Here, we
are only using the default value 1 which corresponds to the standard
LASSO method, but a parameter alpha
between 0 and 1 can be
chosen as well.
The function used for every LASSO fit is glmnet which creates a
sequence of lambda values. The lambda which is chosen according to
lambda.type
is the lambda on position
lambda.ind
of the sequence and its value is stored in
lambda.min
. In general, the lower the value of
lambda.ind
, the higher the lambda.min
and thus
the penalization. This leads to more sparse models. The number of
lambda
values can be chosen with an optional argument. We
give the output of pl1 as an example.
## [[1]]
## [1] 26
##
## [[2]]
## [1] 5
##
## [[3]]
## [1] 8
##
## [[4]]
## [1] 49
## [[1]]
## [1] 0.002421659
##
## [[2]]
## [1] 0.05201088
##
## [[3]]
## [1] 0.0129989
##
## [[4]]
## [1] 0.0213665
The values of lambda are generated in every call of glmnet and hence
cannot be compared. The corresponding mean cross validated error is
stored in the list min.cvm
. The interpretation of its
values depend on the argument type.measure
. In our examples
with a binary outcome it is usually the area under the ROC curve
(AUC).
## [[1]]
## [1] 0.6197587
##
## [[2]]
## [1] 0.638546
##
## [[3]]
## [1] 0.6735664
##
## [[4]]
## [1] 0.8342153
We see that for the example of pl1 it grows with the number of considered modalities. That is what we expected because the more coefficients we have in our model, the better the prediction should be.
prioritylasso was generally implemented with the idea, that
practitioners have prior knowledge about the data that allows them to
specify the priorities. However, it might be, especially in the presence
of several blocks, that it is not clear which order of blocks is the
best. That is why we implemented cvm_prioritylasso that allows us to
define more than one possible list of blocks. The function chooses the
best block order according to the mean cross-validated error.
Analogously, several vectors with maximal coefficients can be specified
in max.coef.list
. In the next example we do not know if it
is better to include variables b4 before those of b3.
set.seed(1234)
cvm_pl1 <- cvm_prioritylasso(X = pl_train, Y = pl_out_train, family = "binomial",
type.measure = "auc", standardize = FALSE,
block1.penalization = FALSE,
blocks.list = list(list(1:4, 5:9, 10:28, 29:1028),
list(1:4, 5:9, 29:1028, 10:28)),
max.coef.list = list(c(Inf, Inf, Inf, 10), c(Inf, Inf, 10, Inf)))
cvm_pl1$best.blocks
## [1] "bp1 = 1:4" "bp2 = 5:9" "bp3 = 29:1028" "bp4 = 10:28"
We see that the order of blocks which leads to the best result is
specified first in blocks.list
. It might be inconvenient to
define a lot of lists in the block argument. Note that it is not the
general idea of prioritylasso to be applied to any order of blocks and
fish for the best.
Now we give a detailed example with a validation.
set.seed(1234)
pl_fit1 <- prioritylasso(X = pl_train, Y = pl_out_train, family = "binomial",
type.measure = "auc", blocks = list(1:4, 5:9, 10:28, 29:1028),
block1.penalization = FALSE, max.coef = c(Inf, Inf, Inf, 10),
standardize = FALSE)
The idea before simulating the data was to include it in the order as it was defined in the data matrix. The first block, a block with only 4 variables should not be penalized. Because of prior knowledge (we have it because we generated the data, normally a practitioner could have it, too) we know that all of its variables are relevant. The maximal number of coefficients for the last block is restricted to 10. We can easily extract the coefficients and print those which are nonzero.
## b1.1 b1.2 b1.3 b1.4 b2.3 b3.2 b3.3 b3.4 b3.7
## 0.1683 1.0135 0.9081 1.0682 0.3760 0.1123 0.3115 -0.4062 0.8406 -0.2676
## b3.11 b4.1 b4.8 b4.106 b4.269 b4.549 b4.623 b4.748 b4.831 b4.977
## 0.1103 0.0056 0.0615 0.0407 0.0516 0.0056 0.0863 0.0211 0.0012 0.2553
We see that every variable from block 1 is nonzero, but we also have variables from every other block in our model.
The R package pROC
gives a lot possibilities to evaluate
and visualize the results. First we can calculate the score and the ROC
curve in the training set and get an AUC value.
pl1_score <- pl_train[ , names(coeff1)[-1], drop=F] %*% coeff1[-1]
pl1_roc <- roc(factor(pl_out_train), pl1_score[,1])
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.8318
Of course we can’t use this as a performance measure because it is estimated based on the data that was already used for training the model and is thus biased. A more appropriate value can be obtained with the test set.
val1_score <- pl_val[ , names(coeff1)[-1], drop=F] %*% coeff1[-1]
val1_roc <- roc(factor(pl_out_val), c(val1_score))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.6471
To investigate the influence of the priorities on the prediction accuracy, we can calculate the model again with another order of blocks and get the ROC curve in the same way.
set.seed(1234)
pl_fit2 <- prioritylasso(X = pl_train, Y = pl_out_train, family = "binomial",
type.measure = "auc", blocks = list(1:4, 10:28, 5:9, 29:1028),
block1.penalization = FALSE, max.coef = c(Inf, Inf, Inf, 10),
standardize = FALSE)
coeff2 <- pl_fit2$coefficients
coeff2 <- coeff2[coeff2 != 0]
val2_score <- pl_val[ , names(coeff2)[-1], drop=F] %*% coeff2[-1]
val2_roc <- roc(factor(pl_out_val), c(val2_score))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.6394
We can plot the ROC curves and perform a paired test if the two AUC values are equal.
##
## DeLong's test for two correlated ROC curves
##
## data: val1_roc and val2_roc
## Z = 0.90554, p-value = 0.3652
## alternative hypothesis: true difference in AUC is not equal to 0
## 95 percent confidence interval:
## -0.008909984 0.024213666
## sample estimates:
## AUC of roc1 AUC of roc2
## 0.6470588 0.6394070
The test result shows that we cannot reject the hypotheses that the AUC values are equal.