Title: | Analyzing Multiple Omics Data with an Offset Approach |
---|---|
Description: | Priority-LASSO (Klau et al., 2018) fits successive Lasso models for several blocks of (omics) data with different priorities and takes the predicted values as an offset for the next block. Also offers options to deal with block-wise missingness in multi-omics data. Reference: Klau, S., Jurinovic, V., Hornung, R., Herold, T., Boulesteix, A.-L. (2018) Priority-Lasso: a simple hierarchical approach to the prediction of clinical outcome using multi-omics data. BMC Bioinformatics 19:322, <doi:10.1186/s12859-018-2344-6>. |
Authors: | Simon Klau, Roman Hornung, Alina Bauer, Jonas Hagenberg |
Maintainer: | Roman Hornung <[email protected]> |
License: | GPL-2 |
Version: | 0.3.2 |
Built: | 2024-11-03 04:20:54 UTC |
Source: | https://github.com/romanhornung/prioritylasso |
Calculates the offsets for the current block
calculate_offsets( current_missings, current_observations, mcontrol, current_block, pred, liste, X, blocks, current_intercept )
calculate_offsets( current_missings, current_observations, mcontrol, current_block, pred, liste, X, blocks, current_intercept )
current_missings |
index vector (indices) of current missing observations |
current_observations |
index vector (indices) of current used observations |
mcontrol |
control for missing data handling |
current_block |
index of current block |
pred |
predictions of current block |
liste |
list with offsets |
X |
original data |
blocks |
information which variable belongs to which block |
current_intercept |
the intercept estimated for the current block |
list with offsets, used imputation model and the blocks used for the imputation model (if applicable)
Extract coefficients from a prioritylasso object
## S3 method for class 'prioritylasso' coef(object, ...)
## S3 method for class 'prioritylasso' coef(object, ...)
object |
model of type prioritylasso |
... |
additional arguments, currently not used |
List with the coefficients and the intercepts
Compare the rows of a matrix with a pattern
compare_boolean(object, pattern)
compare_boolean(object, pattern)
object |
matrix |
pattern |
pattern which is compared against the rows of the matrix |
logical vector if the pattern matches the rows
Runs prioritylasso for a list of block specifications and gives the best results in terms of cv error.
cvm_prioritylasso( X, Y, weights, family, type.measure, blocks.list, max.coef.list = NULL, block1.penalization = TRUE, lambda.type = "lambda.min", standardize = TRUE, nfolds = 10, foldid, cvoffset = FALSE, cvoffsetnfolds = 10, ... )
cvm_prioritylasso( X, Y, weights, family, type.measure, blocks.list, max.coef.list = NULL, block1.penalization = TRUE, lambda.type = "lambda.min", standardize = TRUE, nfolds = 10, foldid, cvoffset = FALSE, cvoffsetnfolds = 10, ... )
X |
a (nxp) matrix of predictors with observations in rows and predictors in columns. |
Y |
n-vector giving the value of the response (either continuous, numeric-binary 0/1, or |
weights |
observation weights. Default is 1 for each observation. |
family |
should be "gaussian" for continuous |
type.measure |
accuracy/error measure computed in cross-validation. It should be "class" (classification error) or "auc" (area under the ROC curve) if |
blocks.list |
list of the format |
max.coef.list |
list of |
block1.penalization |
whether the first block should be penalized. Default is TRUE. |
lambda.type |
specifies the value of lambda used for the predictions. |
standardize |
logical, whether the predictors should be standardized or not. Default is TRUE. |
nfolds |
the number of CV procedure folds. |
foldid |
an optional vector of values between 1 and nfold identifying what fold each observation is in. |
cvoffset |
logical, whether CV should be used to estimate the offsets. Default is FALSE. |
cvoffsetnfolds |
the number of folds in the CV procedure that is performed to estimate the offsets. Default is 10. Only relevant if |
... |
other arguments that can be passed to the function |
object of class cvm_prioritylasso
with the following elements. If these elements are lists, they contain the results for each penalized block of the best result.
lambda.ind
list with indices of lambda for lambda.type
.
lambda.type
type of lambda which is used for the predictions.
lambda.min
list with values of lambda for lambda.type
.
min.cvm
list with the mean cross-validated errors for lambda.type
.
nzero
list with numbers of non-zero coefficients for lambda.type
.
glmnet.fit
list of fitted glmnet
objects.
name
a text string indicating type of measure.
block1unpen
if block1.penalization = FALSE
, the results of either the fitted glm
or coxph
object.
best.blocks
character vector with the indices of the best block specification.
best.blocks.indices
list with the indices of the best block specification ordered by best to worst.
best.max.coef
vector with the number of maximal coefficients corresponding to best.blocks
.
best.model
complete prioritylasso
model of the best solution.
coefficients
coefficients according to the results obtained with best.blocks
.
call
the function call.
The function description and the first example are based on the R package ipflasso
.
Simon Klau
Maintainer: Roman Hornung ([email protected])
Klau, S., Jurinovic, V., Hornung, R., Herold, T., Boulesteix, A.-L. (2018). Priority-Lasso: a simple hierarchical approach to the prediction of clinical outcome using multi-omics data. BMC Bioinformatics 19, 322
pl_data
, prioritylasso
, cvr2.ipflasso
cvm_prioritylasso(X = matrix(rnorm(50*500),50,500), Y = rnorm(50), family = "gaussian", type.measure = "mse", lambda.type = "lambda.min", nfolds = 5, blocks.list = list(list(bp1=1:75, bp2=76:200, bp3=201:500), list(bp1=1:75, bp2=201:500, bp3=76:200))) ## Not run: cvm_prioritylasso(X = pl_data[,1:1028], Y = pl_data[,1029], family = "binomial", type.measure = "auc", standardize = FALSE, block1.penalization = FALSE, blocks.list = list(list(1:4, 5:9, 10:28, 29:1028), list(1:4, 5:9, 29:1028, 10:28)), max.coef.list = list(c(Inf, Inf, Inf, 10), c(Inf, Inf, 10, Inf))) ## End(Not run)
cvm_prioritylasso(X = matrix(rnorm(50*500),50,500), Y = rnorm(50), family = "gaussian", type.measure = "mse", lambda.type = "lambda.min", nfolds = 5, blocks.list = list(list(bp1=1:75, bp2=76:200, bp3=201:500), list(bp1=1:75, bp2=201:500, bp3=76:200))) ## Not run: cvm_prioritylasso(X = pl_data[,1:1028], Y = pl_data[,1029], family = "binomial", type.measure = "auc", standardize = FALSE, block1.penalization = FALSE, blocks.list = list(list(1:4, 5:9, 10:28, 29:1028), list(1:4, 5:9, 29:1028, 10:28)), max.coef.list = list(c(Inf, Inf, Inf, 10), c(Inf, Inf, 10, Inf))) ## End(Not run)
prioritylasso
Construct control structures for handling of missing data for prioritylasso
missing.control( handle.missingdata = c("none", "ignore", "impute.offset"), offset.firstblock = c("zero", "intercept"), impute.offset.cases = c("complete.cases", "available.cases"), nfolds.imputation = 10, lambda.imputation = c("lambda.min", "lambda.1se"), perc.comp.cases.warning = 0.3, threshold.available.cases = 30, select.available.cases = c("maximise.blocks", "max") )
missing.control( handle.missingdata = c("none", "ignore", "impute.offset"), offset.firstblock = c("zero", "intercept"), impute.offset.cases = c("complete.cases", "available.cases"), nfolds.imputation = 10, lambda.imputation = c("lambda.min", "lambda.1se"), perc.comp.cases.warning = 0.3, threshold.available.cases = 30, select.available.cases = c("maximise.blocks", "max") )
handle.missingdata |
how blockwise missing data should be treated. Default is |
offset.firstblock |
determines if the offset of the first block for missing observations is zero or the intercept of the observed values for |
impute.offset.cases |
which cases/observations should be used for the imputation model to impute missing offsets. Supported are complete cases (additional constraint is that every observation can only contain one missing block) and all available observations which have an overlap with the current block. |
nfolds.imputation |
nfolds for the glmnet of the imputation model |
lambda.imputation |
which lambda-value should be used for predicting the imputed offsets in cv.glmnet |
perc.comp.cases.warning |
percentage of complete cases when a warning is issued of too few cases for the imputation model |
threshold.available.cases |
if the number of available cases for |
select.available.cases |
determines how the blocks which are used for the imputation model are selected when |
list with control parameters
A data set containing the binary outcome and 1028 predictor variables of 400 artificial AML patients.
pl_data
pl_data
A data frame with 400 rows and 1029 variables:
pl_data[,1029]
)binary outcome representing refractory status.
pl_data[,1:4]
)4 binary variables representing variables with a known influence on the outcome.
pl_data[,5:9]
)5 continuous variables representing clinical variables.
pl_data[,10:28]
)19 binary variables representing mutations.
pl_data[,29:1028]
)1000 continuous variables representing gene expression data.
We generated the data in the following way: We took the empirical correlation of 1028 variables related to
315 AML patients. This correlation served as a correlation matrix when generating 1028 multivariate
normally distributed variables with the R function rmvnorm
. Because we didn't have a positive
definite matrix, we took the nearest positive definite matrix according to the function nearPD
.
The variables that should be binary were dichotomized, so that their marginal probabilities corresponded to
the marginal probabilities they were based on.
The coefficients were defined by
beta_b1 <- c(0.8, 0.8, 0.6, 0.6)
beta_b2 <- c(rep(0.5,3), rep(0,2))
beta_b3 <- c(rep(0.4, 4), rep(0,15))
beta_b4 <- c(rep(0.5, 5), rep(0.3, 5), rep(0,990))
.
We included them in the vector beta <- c(beta_b1, beta_b2, beta_b3, beta_b4)
and calculated
the probability through
where x denotes our data matrix
with 1028 predictor variables. Finally we got the outcome through
pl_out <- rbinom(400, size = 1, p = pi)
.
Makes predictions for a prioritylasso
object. It can be chosen between linear predictors or fitted values.
## S3 method for class 'prioritylasso' predict( object, newdata = NULL, type = c("link", "response"), handle.missingtestdata = c("none", "omit.prediction", "set.zero", "impute.block"), include.allintercepts = FALSE, use.blocks = "all", ... )
## S3 method for class 'prioritylasso' predict( object, newdata = NULL, type = c("link", "response"), handle.missingtestdata = c("none", "omit.prediction", "set.zero", "impute.block"), include.allintercepts = FALSE, use.blocks = "all", ... )
object |
An object of class |
newdata |
(nnew |
type |
Specifies the type of predictions. |
handle.missingtestdata |
Specifies how to deal with missing data in the test data; possibilities are |
include.allintercepts |
should the intercepts from all blocks included in the prediction? If |
use.blocks |
determines which blocks are used for the prediction, the default is all. Otherwise one can specify the number of blocks which are used in a vector |
... |
Further arguments passed to or from other methods. |
handle.missingtestdata
specifies how to deal with missing data.
The default none
cannot handle missing data, omit.prediction
does not make a prediction for observations with missing values and return NA
. set.zero
ignores
the missing data for the calculation of the prediction (the missing value is set to zero).
impute.block
uses an imputation model to impute the offset of a missing block. This only works if the prioritylasso object was fitted with handle.missingdata = "impute.offset"
.
If impute.offset.cases = "complete.cases"
was used, then every observation can have only one missing block. For observations with more than one missing block, NA
is returned.
If impute.offset.cases = "available.cases"
was used, the missingness pattern in the test data has to be the same as in the train data. For observations with an unknown missingness pattern, NA
is returned.
Predictions that depend on type
.
Simon Klau
pl_bin <- prioritylasso(X = matrix(rnorm(50*190),50,190), Y = rbinom(50,1,0.5), family = "binomial", type.measure = "auc", blocks = list(block1=1:13,block2=14:80, block3=81:190), block1.penalization = TRUE, lambda.type = "lambda.min", standardize = FALSE, nfolds = 3) newdata_bin <- matrix(rnorm(10*190),10,190) predict(object = pl_bin, newdata = newdata_bin, type = "response")
pl_bin <- prioritylasso(X = matrix(rnorm(50*190),50,190), Y = rbinom(50,1,0.5), family = "binomial", type.measure = "auc", blocks = list(block1=1:13,block2=14:80, block3=81:190), block1.penalization = TRUE, lambda.type = "lambda.min", standardize = FALSE, nfolds = 3) newdata_bin <- matrix(rnorm(10*190),10,190) predict(object = pl_bin, newdata = newdata_bin, type = "response")
Fits successive Lasso models for several ordered blocks of (omics) data and takes the predicted values as an offset for the next block.
prioritylasso( X, Y, weights, family = c("gaussian", "binomial", "cox"), type.measure, blocks, max.coef = NULL, block1.penalization = TRUE, lambda.type = "lambda.min", standardize = TRUE, nfolds = 10, foldid, cvoffset = FALSE, cvoffsetnfolds = 10, mcontrol = missing.control(), scale.y = FALSE, return.x = TRUE, ... )
prioritylasso( X, Y, weights, family = c("gaussian", "binomial", "cox"), type.measure, blocks, max.coef = NULL, block1.penalization = TRUE, lambda.type = "lambda.min", standardize = TRUE, nfolds = 10, foldid, cvoffset = FALSE, cvoffsetnfolds = 10, mcontrol = missing.control(), scale.y = FALSE, return.x = TRUE, ... )
X |
a (nxp) matrix of predictors with observations in rows and predictors in columns. |
Y |
n-vector giving the value of the response (either continuous, numeric-binary 0/1, or |
weights |
observation weights. Default is 1 for each observation. |
family |
should be "gaussian" for continuous |
type.measure |
accuracy/error measure computed in cross-validation. It should be "class" (classification error) or "auc" (area under the ROC curve) if |
blocks |
list of the format |
max.coef |
vector with integer values which specify the number of maximal coefficients for each block. The first entry is omitted if |
block1.penalization |
whether the first block should be penalized. Default is TRUE. |
lambda.type |
specifies the value of lambda used for the predictions. |
standardize |
logical, whether the predictors should be standardized or not. Default is TRUE. |
nfolds |
the number of CV procedure folds. |
foldid |
an optional vector of values between 1 and nfold identifying what fold each observation is in. |
cvoffset |
logical, whether CV should be used to estimate the offsets. Default is FALSE. |
cvoffsetnfolds |
the number of folds in the CV procedure that is performed to estimate the offsets. Default is 10. Only relevant if |
mcontrol |
controls how to deal with blockwise missing data. For details see below or |
scale.y |
determines if y gets scaled before passed to glmnet. Can only be used for |
return.x |
logical, determines if the input data should be returned by |
... |
other arguments that can be passed to the function |
For block1.penalization = TRUE
, the function fits a Lasso model for each block. First, a standard Lasso for the first entry of blocks
(block of priority 1) is fitted.
The predictions are then taken as an offset in the Lasso fit of the block of priority 2, etc.
For block1.penalization = FALSE
, the function fits a model without penalty to the block of priority 1 (recommended as a block with clinical predictors where p < n
).
This is either a generalized linear model for family "gaussian" or "binomial", or a Cox model. The predicted values are then taken as an offset in the following Lasso fit of the block with priority 2, etc.
The first entry of blocks
contains the indices of variables of the block with priority 1 (first block included in the model).
Assume that blocks = list(1:100, 101:200, 201:300)
then the block with priority 1 consists of the first 100 variables of the data matrix.
Analogously, the block with priority 2 consists of the variables 101 to 200 and the block with priority 3 of the variables 201 to 300.
standardize = TRUE
leads to a standardisation of the covariables (X
) in glmnet
which is recommend by glmnet
.
In case of an unpenalized first block, the covariables for the first block are not standardized.
Please note that the returned coefficients are rescaled to the original scale of the covariates as provided in X
.
Therefore, new data in predict.prioritylasso
should be on the same scale as X
.
To use the method with blockwise missing data, one can set handle.missingdata = ignore
.
Then, to calculate the coefficients for a given block only the observations with values for this blocks are used.
For the observations with missing values, the result from the previous block is used as the offset for the next block.
Crossvalidated offsets are not supported with handle.missingdata = ignore
.
Please note that dealing with single missing values is not supported.
Normally, every observation gets a unique foldid which stays the same across all blocks for the call to cv.glmnet
.
However when handle.missingdata != none
, the foldid is set new for every block.
object of class prioritylasso
with the following elements. If these elements are lists, they contain the results for each penalized block.
lambda.ind
list with indices of lambda for lambda.type
.
lambda.type
type of lambda which is used for the predictions.
lambda.min
list with values of lambda for lambda.type
.
min.cvm
list with the mean cross-validated errors for lambda.type
.
nzero
list with numbers of non-zero coefficients for lambda.type
.
glmnet.fit
list of fitted glmnet
objects.
name
a text string indicating type of measure.
block1unpen
if block1.penalization = FALSE
, the results of either the fitted glm
or coxph
object corresponding to best.blocks
.
coefficients
vector of estimated coefficients. If block1.penalization = FALSE
and family = gaussian
or binomial
, the first entry contains an intercept.
call
the function call.
X
the original data used for the calculation or NA
if return.x = FALSE
missing.data
list with logical entries for every block which observation is missing (TRUE
means missing)
imputation.models
if handle.missingdata = "impute.offsets"
, it contains the used imputation models
blocks.used.for.imputation
if handle.missingdata = "impute.offsets"
, it contains the blocks which were used for the imputation model for every block
y.scale.param
if scale.y = TRUE
, then it contains the mean and sd used for scaling.
blocks
list with the description which variables belong to which block
mcontrol
the missing control settings used
family
the family of the fitted data
dim.x
the dimension of the used training data
The function description and the first example are based on the R package ipflasso
. The second example is inspired by the example of cv.glmnet
from the glmnet
package.
Simon Klau, Roman Hornung, Alina Bauer
Maintainer: Roman Hornung ([email protected])
Klau, S., Jurinovic, V., Hornung, R., Herold, T., Boulesteix, A.-L. (2018). Priority-Lasso: a simple hierarchical approach to the prediction of clinical outcome using multi-omics data. BMC Bioinformatics 19, 322
pl_data
, cvm_prioritylasso
, cvr.ipflasso
, cvr2.ipflasso
, missing.control
# gaussian prioritylasso(X = matrix(rnorm(50*500),50,500), Y = rnorm(50), family = "gaussian", type.measure = "mse", blocks = list(bp1=1:75, bp2=76:200, bp3=201:500), max.coef = c(Inf,8,5), block1.penalization = TRUE, lambda.type = "lambda.min", standardize = TRUE, nfolds = 5, cvoffset = FALSE) ## Not run: # cox # simulation of survival data: n <- 50;p <- 300 nzc <- trunc(p/10) x <- matrix(rnorm(n*p), n, p) beta <- rnorm(nzc) fx <- x[, seq(nzc)]%*%beta/3 hx <- exp(fx) # survival times: ty <- rexp(n,hx) # censoring indicator: tcens <- rbinom(n = n,prob = .3,size = 1) library(survival) y <- Surv(ty, 1-tcens) blocks <- list(bp1=1:20, bp2=21:200, bp3=201:300) # run prioritylasso: prioritylasso(x, y, family = "cox", type.measure = "deviance", blocks = blocks, block1.penalization = TRUE, lambda.type = "lambda.min", standardize = TRUE, nfolds = 5) # binomial # using pl_data: prioritylasso(X = pl_data[,1:1028], Y = pl_data[,1029], family = "binomial", type.measure = "auc", blocks = list(bp1=1:4, bp2=5:9, bp3=10:28, bp4=29:1028), standardize = FALSE) ## End(Not run)
# gaussian prioritylasso(X = matrix(rnorm(50*500),50,500), Y = rnorm(50), family = "gaussian", type.measure = "mse", blocks = list(bp1=1:75, bp2=76:200, bp3=201:500), max.coef = c(Inf,8,5), block1.penalization = TRUE, lambda.type = "lambda.min", standardize = TRUE, nfolds = 5, cvoffset = FALSE) ## Not run: # cox # simulation of survival data: n <- 50;p <- 300 nzc <- trunc(p/10) x <- matrix(rnorm(n*p), n, p) beta <- rnorm(nzc) fx <- x[, seq(nzc)]%*%beta/3 hx <- exp(fx) # survival times: ty <- rexp(n,hx) # censoring indicator: tcens <- rbinom(n = n,prob = .3,size = 1) library(survival) y <- Surv(ty, 1-tcens) blocks <- list(bp1=1:20, bp2=21:200, bp3=201:300) # run prioritylasso: prioritylasso(x, y, family = "cox", type.measure = "deviance", blocks = blocks, block1.penalization = TRUE, lambda.type = "lambda.min", standardize = TRUE, nfolds = 5) # binomial # using pl_data: prioritylasso(X = pl_data[,1:1028], Y = pl_data[,1029], family = "binomial", type.measure = "auc", blocks = list(bp1=1:4, bp2=5:9, bp3=10:28, bp4=29:1028), standardize = FALSE) ## End(Not run)