Title: | Innovative Complex Split Procedures in Random Forests Through Candidate Split Sampling |
---|---|
Description: | Implements interaction forests [1], which are specific diversity forests and the basic form of diversity forests that uses univariable, binary splitting [2]. Interaction forests (IFs) are ensembles of decision trees that model quantitative and qualitative interaction effects using bivariable splitting. IFs come with the Effect Importance Measure (EIM), which can be used to identify variable pairs that feature quantitative and qualitative interaction effects with high predictive relevance. IFs and EIM focus on well interpretable forms of interactions. The package also offers plot functions for visualising the estimated forms of interaction effects. Categorical, metric, and survival outcomes are supported. This is a fork of the R package 'ranger' (main author: Marvin N. Wright) that implements random forests using an efficient C++ implementation. References: [1] Hornung, R. & Boulesteix, A.-L. (2022) Interaction Forests: Identifying and exploiting interpretable quantitative and qualitative interaction effects. Computational Statistics & Data Analysis 171:107460, <doi:10.1016/j.csda.2022.107460>. [2] Hornung, R. (2022) Diversity forests: Using split sampling to enable innovative complex split procedures in random forests. SN Computer Science 3(2):1, <doi:10.1007/s42979-021-00920-1>. |
Authors: | Roman Hornung [aut, cre], Marvin N. Wright [ctb, cph] |
Maintainer: | Roman Hornung <[email protected]> |
License: | GPL-3 |
Version: | 0.4.0 |
Built: | 2024-11-06 03:45:04 UTC |
Source: | https://github.com/romanhornung/diversityforest |
The diversity forest algorithm is not a specific algorithm, but an alternative candidate split sampling scheme that makes complex split procedures in random forests possible computationally by drastically reducing the numbers of candidate splits that need to be evaluated for each split. It also avoids the well-known variable selection bias in conventional random forests that has the effect that variables with many possible splits are selected too frequently for splitting (Strobl et al., 2007). For details, see Hornung (2022).
This package currently features two types of diversity forests:
the basic form of diversity forests that uses univariable, binary splitting, which is also used in conventional random forests
interaction forests (IFs) (Hornung & Boulesteix, 2022), which use bivariable splitting to model quantitative and qualitative interaction effects. IFs feature the Effect Importance Measure (EIM), which ranks the variable pairs with respect to the predictive importance of their quantitative and qualitative interaction effects. The individual variables can be ranked as well using EIM. For details, see Hornung & Boulesteix (2022).
Diversity forests with univariable splitting can be constructed using the function divfor
and
interaction forests using the function interactionfor
. Both functions support categorical,
metric, and survival outcomes.
This package is a fork of the R package 'ranger' that implements random forests using an efficient C++ implementation. The documentation is in large parts taken from 'ranger', where some parts of the documentation may not apply to (the current version of) the 'diversityForest' package.
Details on further functionalities of the code that are not presented in the help pages of 'diversityForest' are found in the help pages of 'ranger', version 0.11.0, because 'diversityForest' is based on the latter version of 'ranger'. The code in the example sections can be used as a template for all basic application scenarios with respect to classification, regression and survival prediction.
Hornung, R. (2022). Diversity forests: Using split sampling to enable innovative complex split procedures in random forests. SN Computer Science 3(2):1, <doi:10.1007/s42979-021-00920-1>.
Hornung, R., Boulesteix, A.-L. (2022). Interaction forests: Identifying and exploiting interpretable quantitative and qualitative interaction effects. Computational Statistics & Data Analysis 171:107460, <doi:10.1016/j.csda.2022.107460>.
Strobl, C., Boulesteix, A.-L., Zeileis, A., Hothorn, T. (2007). Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics 8:25, <doi:10.1186/1471-2105-8-25>.
Wright, M. N., Ziegler, A. (2017). ranger: A fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software 77:1-17, <doi:10.18637/jss.v077.i01>.
Implements the most basic form of diversity forests that uses univariable, binary splitting. Currently, categorical, metric, and survival outcomes are supported.
divfor( formula = NULL, data = NULL, num.trees = 500, mtry = NULL, importance = "none", write.forest = TRUE, probability = FALSE, min.node.size = NULL, max.depth = NULL, replace = TRUE, sample.fraction = ifelse(replace, 1, 0.632), case.weights = NULL, class.weights = NULL, splitrule = NULL, num.random.splits = 1, alpha = 0.5, minprop = 0.1, split.select.weights = NULL, always.split.variables = NULL, respect.unordered.factors = NULL, scale.permutation.importance = FALSE, keep.inbag = FALSE, inbag = NULL, holdout = FALSE, quantreg = FALSE, oob.error = TRUE, num.threads = NULL, save.memory = FALSE, verbose = TRUE, seed = NULL, dependent.variable.name = NULL, status.variable.name = NULL, classification = NULL, nsplits = 30, proptry = 1 )
divfor( formula = NULL, data = NULL, num.trees = 500, mtry = NULL, importance = "none", write.forest = TRUE, probability = FALSE, min.node.size = NULL, max.depth = NULL, replace = TRUE, sample.fraction = ifelse(replace, 1, 0.632), case.weights = NULL, class.weights = NULL, splitrule = NULL, num.random.splits = 1, alpha = 0.5, minprop = 0.1, split.select.weights = NULL, always.split.variables = NULL, respect.unordered.factors = NULL, scale.permutation.importance = FALSE, keep.inbag = FALSE, inbag = NULL, holdout = FALSE, quantreg = FALSE, oob.error = TRUE, num.threads = NULL, save.memory = FALSE, verbose = TRUE, seed = NULL, dependent.variable.name = NULL, status.variable.name = NULL, classification = NULL, nsplits = 30, proptry = 1 )
formula |
Object of class |
data |
Training data of class |
num.trees |
Number of trees. Default is 500. |
mtry |
Artefact from 'ranger'. NOT needed for diversity forests. |
importance |
Variable importance mode, one of 'none', 'impurity', 'impurity_corrected', 'permutation'. The 'impurity' measure is the Gini index for classification, the variance of the responses for regression and the sum of test statistics (see |
write.forest |
Save |
probability |
Grow a probability forest as in Malley et al. (2012). NOTE: Not yet implemented for diversity forests! |
min.node.size |
Minimal node size. Default 1 for classification, 5 for regression, 3 for survival, and 5 for probability. |
max.depth |
Maximal tree depth. A value of NULL or 0 (the default) corresponds to unlimited depth, 1 to tree stumps (1 split per tree). |
replace |
Sample with replacement. |
sample.fraction |
Fraction of observations to sample. Default is 1 for sampling with replacement and 0.632 for sampling without replacement. For classification, this can be a vector of class-specific values. |
case.weights |
Weights for sampling of training observations. Observations with larger weights will be selected with higher probability in the bootstrap (or subsampled) samples for the trees. |
class.weights |
Weights for the outcome classes (in order of the factor levels) in the splitting rule (cost sensitive learning). Classification and probability prediction only. For classification the weights are also applied in the majority vote in terminal nodes. |
splitrule |
Splitting rule. For classification and probability estimation "gini" or "extratrees" with default "gini". For regression "variance", "extratrees" or "maxstat" with default "variance". For survival "logrank", "extratrees", "C" or "maxstat" with default "logrank". NOTE: For diversity forests currently only the default splitting rules are supported. |
num.random.splits |
Artefact from 'ranger'. NOT needed for diversity forests. |
alpha |
For "maxstat" splitrule: Significance threshold to allow splitting. NOT needed for diversity forests. |
minprop |
For "maxstat" splitrule: Lower quantile of covariate distribution to be considered for splitting. NOT needed for diversity forests. |
split.select.weights |
Numeric vector with weights between 0 and 1, representing the probability to select variables for splitting. Alternatively, a list of size num.trees, containing split select weight vectors for each tree can be used. |
always.split.variables |
Currently not useable. Character vector with variable names to be always selected. |
respect.unordered.factors |
Handling of unordered factor covariates. One of 'ignore' and 'order' (the option 'partition' possible in 'ranger' is not (yet) possible with diversity forests). Default is 'ignore'. Alternatively TRUE (='order') or FALSE (='ignore') can be used. |
scale.permutation.importance |
Scale permutation importance by standard error as in (Breiman 2001). Only applicable if permutation variable importance mode selected. |
keep.inbag |
Save how often observations are in-bag in each tree. |
inbag |
Manually set observations per tree. List of size num.trees, containing inbag counts for each observation. Can be used for stratified sampling. |
holdout |
Hold-out mode. Hold-out all samples with case weight 0 and use these for variable importance and prediction error. |
quantreg |
Prepare quantile prediction as in quantile regression forests (Meinshausen 2006). Regression only. Set |
oob.error |
Compute OOB prediction error. Set to |
num.threads |
Number of threads. Default is number of CPUs available. |
save.memory |
Use memory saving (but slower) splitting mode. No effect for survival and GWAS data. Warning: This option slows down the tree growing, use only if you encounter memory problems. NOT needed for diversity forests. |
verbose |
Show computation status and estimated runtime. |
seed |
Random seed. Default is |
dependent.variable.name |
Name of outcome variable, needed if no formula given. For survival forests this is the time variable. |
status.variable.name |
Name of status variable, only applicable to survival data and needed if no formula given. Use 1 for event and 0 for censoring. |
classification |
Only needed if data is a matrix. Set to |
nsplits |
Number of candidate splits to sample for each split. Default is 30. |
proptry |
Parameter that restricts the number of candidate splits considered for small nodes. If |
Object of class divfor
with elements
forest |
Saved forest (If write.forest set to TRUE). Note that the variable IDs in the |
predictions |
Predicted classes/values, based on out-of-bag samples (classification and regression only). |
variable.importance |
Variable importance for each independent variable. |
prediction.error |
Overall out-of-bag prediction error. For classification this is the fraction of missclassified samples, for probability estimation the Brier score, for regression the mean squared error and for survival one minus Harrell's C-index. |
r.squared |
R squared. Also called explained variance or coefficient of determination (regression only). Computed on out-of-bag data. |
confusion.matrix |
Contingency table for classes and predictions based on out-of-bag samples (classification only). |
unique.death.times |
Unique death times (survival only). |
chf |
Estimated cumulative hazard function for each sample (survival only). |
survival |
Estimated survival function for each sample (survival only). |
call |
Function call. |
num.trees |
Number of trees. |
num.independent.variables |
Number of independent variables. |
min.node.size |
Value of minimal node size used. |
treetype |
Type of forest/tree. classification, regression or survival. |
importance.mode |
Importance mode used. |
num.samples |
Number of samples. |
splitrule |
Splitting rule. |
replace |
Sample with replacement. |
nsplits |
Value of |
proptry |
Value of |
Roman Hornung, Marvin N. Wright
Hornung, R. (2022). Diversity forests: Using split sampling to enable innovative complex split procedures in random forests. SN Computer Science 3(2):1, <doi:10.1007/s42979-021-00920-1>.
Wright, M. N., Ziegler, A. (2017). ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software 77:1-17, <doi:10.18637/jss.v077.i01>.
Breiman, L. (2001). Random forests. Machine Learning 45:5-32, <doi:10.1023/A:1010933404324>.
Malley, J. D., Kruppa, J., Dasgupta, A., Malley, K. G., & Ziegler, A. (2012). Probability machines: consistent probability estimation using nonparametric learning machines. Methods of Information in Medicine 51:74-81, <doi:10.3414/ME00-01-0052>.
Meinshausen (2006). Quantile Regression Forests. Journal of Machine Learning Research 7:983-999.
## Not run: ## Load package: library("diversityForest") ## Set seed to obtain reproducible results: set.seed(1234) ## Diversity forest with default settings (NOT recommended) # Classification: divfor(Species ~ ., data = iris, num.trees = 20) # Regression: iris2 <- iris; iris2$Species <- NULL; iris2$Y <- rnorm(nrow(iris2)) divfor(Y ~ ., data = iris2, num.trees = 20) # Survival: library("survival") divfor(Surv(time, status) ~ ., data = veteran, num.trees = 20, respect.unordered.factors = "order") # NOTE: num.trees = 20 is specified too small for practical # purposes - the prediction performance of the resulting # forest will be suboptimal!! # In practice, num.trees = 500 (default value) or a # larger number should be used. ## Diversity forest with specified values for nsplits and proptry (NOT recommended) divfor(Species ~ ., data = iris, nsplits = 10, proptry = 0.4, num.trees = 20) # NOTE again: num.trees = 20 is specified too small for practical purposes. ## Applying diversity forest after optimizing the values of nsplits and proptry (recommended) tuneres <- tunedivfor(formula = Species ~ ., data = iris, num.trees.pre = 20) # NOTE: num.trees.pre = 20 is specified too small for practical # purposes - the out-of-bag error estimates of the forests # constructed during optimization will be much too variable!! # In practice, num.trees.pre = 500 (default value) or a # larger number should be used. divfor(Species ~ ., data = iris, nsplits = tuneres$nsplitsopt, proptry = tuneres$proptryopt, num.trees = 20) # NOTE again: num.trees = 20 is specified too small for practical purposes. ## Prediction train.idx <- sample(nrow(iris), 2/3 * nrow(iris)) iris.train <- iris[train.idx, ] iris.test <- iris[-train.idx, ] tuneres <- tunedivfor(formula = Species ~ ., data = iris.train, num.trees.pre = 20) # NOTE again: num.trees.pre = 20 is specified too small for practical purposes. rg.iris <- divfor(Species ~ ., data = iris.train, nsplits = tuneres$nsplitsopt, proptry = tuneres$proptryopt, num.trees = 20) # NOTE again: num.trees = 20 is specified too small for practical purposes. pred.iris <- predict(rg.iris, data = iris.test) table(iris.test$Species, pred.iris$predictions) ## Variable importance rg.iris <- divfor(Species ~ ., data = iris, importance = "permutation", num.trees = 20) # NOTE again: num.trees = 20 is specified too small for practical purposes. rg.iris$variable.importance ## End(Not run)
## Not run: ## Load package: library("diversityForest") ## Set seed to obtain reproducible results: set.seed(1234) ## Diversity forest with default settings (NOT recommended) # Classification: divfor(Species ~ ., data = iris, num.trees = 20) # Regression: iris2 <- iris; iris2$Species <- NULL; iris2$Y <- rnorm(nrow(iris2)) divfor(Y ~ ., data = iris2, num.trees = 20) # Survival: library("survival") divfor(Surv(time, status) ~ ., data = veteran, num.trees = 20, respect.unordered.factors = "order") # NOTE: num.trees = 20 is specified too small for practical # purposes - the prediction performance of the resulting # forest will be suboptimal!! # In practice, num.trees = 500 (default value) or a # larger number should be used. ## Diversity forest with specified values for nsplits and proptry (NOT recommended) divfor(Species ~ ., data = iris, nsplits = 10, proptry = 0.4, num.trees = 20) # NOTE again: num.trees = 20 is specified too small for practical purposes. ## Applying diversity forest after optimizing the values of nsplits and proptry (recommended) tuneres <- tunedivfor(formula = Species ~ ., data = iris, num.trees.pre = 20) # NOTE: num.trees.pre = 20 is specified too small for practical # purposes - the out-of-bag error estimates of the forests # constructed during optimization will be much too variable!! # In practice, num.trees.pre = 500 (default value) or a # larger number should be used. divfor(Species ~ ., data = iris, nsplits = tuneres$nsplitsopt, proptry = tuneres$proptryopt, num.trees = 20) # NOTE again: num.trees = 20 is specified too small for practical purposes. ## Prediction train.idx <- sample(nrow(iris), 2/3 * nrow(iris)) iris.train <- iris[train.idx, ] iris.test <- iris[-train.idx, ] tuneres <- tunedivfor(formula = Species ~ ., data = iris.train, num.trees.pre = 20) # NOTE again: num.trees.pre = 20 is specified too small for practical purposes. rg.iris <- divfor(Species ~ ., data = iris.train, nsplits = tuneres$nsplitsopt, proptry = tuneres$proptryopt, num.trees = 20) # NOTE again: num.trees = 20 is specified too small for practical purposes. pred.iris <- predict(rg.iris, data = iris.test) table(iris.test$Species, pred.iris$predictions) ## Variable importance rg.iris <- divfor(Species ~ ., data = iris, importance = "permutation", num.trees = 20) # NOTE again: num.trees = 20 is specified too small for practical purposes. rg.iris$variable.importance ## End(Not run)
Extract variable importance of divfor
object.
## S3 method for class 'divfor' importance(x, ...)
## S3 method for class 'divfor' importance(x, ...)
x |
|
... |
Further arguments passed to or from other methods. |
Variable importance measures.
Marvin N. Wright
Implements interaction forests as described in Hornung & Boulesteix (2022). Currently, categorical, metric, and survival outcomes are supported. Interaction forests feature the effect importance measure (EIM), which can be used to rank the covariate variable pairs with respect to the impact of their interaction effects on prediction. This allows to identify relevant interaction effects. Interaction forests focus on well interpretable interaction effects. See the 'Details' section below for more details. In addition, we strongly recommend to consult Section C of Supplementary Material 1 of Hornung & Boulesteix (2022), which uses detailed examples of interaction forest analyses with code to illustrate how interaction forests can be used in applications: Link.
interactionfor( formula = NULL, data = NULL, importance = "both", num.trees = NULL, simplify.large.n = TRUE, num.trees.eim.large.n = NULL, write.forest = TRUE, probability = FALSE, min.node.size = NULL, max.depth = NULL, replace = FALSE, sample.fraction = ifelse(replace, 1, 0.7), case.weights = NULL, class.weights = NULL, splitrule = NULL, always.split.variables = NULL, keep.inbag = FALSE, inbag = NULL, holdout = FALSE, quantreg = FALSE, oob.error = TRUE, num.threads = NULL, verbose = TRUE, seed = NULL, dependent.variable.name = NULL, status.variable.name = NULL, npairs = NULL, classification = NULL )
interactionfor( formula = NULL, data = NULL, importance = "both", num.trees = NULL, simplify.large.n = TRUE, num.trees.eim.large.n = NULL, write.forest = TRUE, probability = FALSE, min.node.size = NULL, max.depth = NULL, replace = FALSE, sample.fraction = ifelse(replace, 1, 0.7), case.weights = NULL, class.weights = NULL, splitrule = NULL, always.split.variables = NULL, keep.inbag = FALSE, inbag = NULL, holdout = FALSE, quantreg = FALSE, oob.error = TRUE, num.threads = NULL, verbose = TRUE, seed = NULL, dependent.variable.name = NULL, status.variable.name = NULL, npairs = NULL, classification = NULL )
formula |
Object of class |
data |
Training data of class |
importance |
Effect importance mode. One of the following: "both" (the default), "qualitative", "quantitative", "mainonly", "none". See the 'Details' section below for explanation. |
num.trees |
Number of trees. The default number is 20000, if EIM values should be computed
and 2000 otherwise. Note that if |
simplify.large.n |
Should restricted tree depths be used, when calculating EIM values for large data sets? See the 'Details' section below for more information. Default is |
num.trees.eim.large.n |
Number of trees in the forest used for calculating the EIM values for large data sets.
If |
write.forest |
Save |
probability |
Grow a probability forest as in Malley et al. (2012). |
min.node.size |
Minimal node size. Default 1 for classification, 5 for regression, 3 for survival, and 5 for probability. |
max.depth |
Maximal tree depth. A value of NULL or 0 (the default) corresponds to unlimited depth, 1 to tree stumps (1 split per tree). |
replace |
Sample with replacement. Default is |
sample.fraction |
Fraction of observations to sample. Default is 1 for sampling with replacement and 0.7 for sampling without replacement. For classification, this can be a vector of class-specific values. |
case.weights |
Weights for sampling of training observations. Observations with larger weights will be selected with higher probability in the bootstrap (or subsampled) samples for the trees. |
class.weights |
Weights for the outcome classes (in order of the factor levels) in the splitting rule (cost sensitive learning). Classification and probability prediction only. For classification the weights are also applied in the majority vote in terminal nodes. |
splitrule |
Splitting rule. For classification and probability estimation "gini" or "extratrees" with default "gini". For regression "variance", "extratrees" or "maxstat" with default "variance". For survival "logrank", "extratrees", "C" or "maxstat" with default "logrank". NOTE: For interaction forests currently only the default splitting rules are supported. |
always.split.variables |
Currently not useable. Character vector with variable names to be always selected. |
keep.inbag |
Save how often observations are in-bag in each tree. |
inbag |
Manually set observations per tree. List of size num.trees, containing inbag counts for each observation. Can be used for stratified sampling. |
holdout |
Hold-out mode. Hold-out all samples with case weight 0 and use these for variable importance and prediction error. NOTE: Currently, not useable for interaction forests. |
quantreg |
Prepare quantile prediction as in quantile regression forests (Meinshausen 2006). Regression only. Set |
oob.error |
Compute OOB prediction error. Set to |
num.threads |
Number of threads. Default is number of CPUs available. |
verbose |
Show computation status and estimated runtime. |
seed |
Random seed. Default is |
dependent.variable.name |
Name of outcome variable, needed if no formula given. For survival forests this is the time variable. |
status.variable.name |
Name of status variable, only applicable to survival data and needed if no formula given. Use 1 for event and 0 for censoring. |
npairs |
Number of variable pairs to sample for each split. Default is the square root of the number of independent variables divided by 2 (this number is rounded up). |
classification |
Only needed if data is a matrix. Set to |
The effect importance measure (EIM) of interaction forests distinguishes quantitative and qualitative interaction effects (Peto, 1982).
This is a common distinction as these two types of interaction effects are interpreted in different ways (see below).
For both of these types, EIM values for each variable pair are obtained: the quantitative and qualitative EIM values.
Interaction forests target easily interpretable types of interaction effects. These can be communicated clearly using statements
of the following kind: "The strength of the positive (negative) effect of variable A on the outcome depends on the level of variable B"
for quantitative interactions, and "for observations with small values of variable B, the effect of variable A is positive (negative),
but for observations with large values of B, the effect of A is negative (positive)" for qualitative interactions.
In addition to calculating EIM values for variable pairs, importance values for the individual variables are calculated as well, the univariable
EIM values. These measure the variable importance as in the case of classical variable importance measures of random forests.
The effect importance mode can be set via the importance
argument: "qualitative"
: Calculate only qualitative EIM values;
"quantitative"
: Calculate only quantitative EIM values; "both"
(the default): Calculate qualitative and quantitative EIM
values; "mainonly"
: Calculate only univariable EIM values.
The top variable pairs with largest quantitative and qualitative EIM values likely have quantitative and qualitative interactions,
respectively, which have a considerable impact on prediction. The top variables with largest univariable EIM values likely have a considerable
impact on prediction. Note that it is currently not possible to test the EIM values for
statistical significance using the interaction forests algorithm itself. However, the p-values
shown in the plots obtained with plotEffects
(which are obtained using bivariable
models) can be adjusted for multiple testing using the Bonferroni procedure to obtain
practical p-values. See the end of the 'Details' section of plotEffects
for explanation and guidance.
If the number of variables is larger than 100, not all possible variable pairs are considered, but, using a screening procedure, the
5000 variable pairs with the strongest indications of interaction effects are pre-selected.
NOTE: To make interpretations, it is crucial to investigate (visually) the forms the interaction effects of variable pairs
with large quantitative and qualitative EIM values take. This can be done using the plot function plot.interactionfor
(first overview) and plotEffects
.
NOTE ALSO: As described in Hornung & Boulesteix (2022), in the case of data with larger numbers of variables (larger than 100,
but more seriously for high-dimensional data), the univariable EIM values can be biased. Therefore, it is strongly recommended
to interpret the univariable EIM values with caution, if the data are high-dimensional. If it is of interest to measure the univariable
importance of the variables for high-dimensional data, an additional conventional random forest (e.g., using the ranger
package)
should be constructed and the variable importance measure values of this random forest be used for ranking the univariable effects.
For large data sets with many observations the calculation of the EIM values can become very costly - when using fully grown trees.
Therefore, when calculating EIM values for data sets with more than 1000 observations we use the following
maximum tree depths by default (argument: simplify.large.n = TRUE
):
if : Use fully grown trees.
if : Use tree depth 10.
if : Use tree depth 7.
if : Use tree depth 5.
Extensive analyses in Hornung & Boulesteix (2022) suggest that by restricting the tree depth in this way,
the EIM values that would result when using fully grown trees are approximated well. However, the prediction
performance suffers, when using restricted trees. Therefore, we restrict the tree depth only when calculating
the EIM values (if ), but construct a second interaction forest with unrestricted tree depth,
which is then used for prediction purposes.
Object of class interactionfor
with elements
predictions |
Predicted classes/values, based on out-of-bag samples (classification and regression only). |
num.trees |
Number of trees. |
num.independent.variables |
Number of independent variables. |
unique.death.times |
Unique death times (survival only). |
min.node.size |
Value of minimal node size used. |
npairs |
Number of variable pairs sampled for each split. |
eim.univ.sorted |
Univariable EIM values sorted in decreasing order. |
eim.univ |
Univariable EIM values. |
eim.qual.sorted |
Qualitative EIM values sorted in decreasing order. |
eim.qual |
Qualitative EIM values. |
eim.quant.sorted |
Quantitative EIM values sorted in decreasing order. |
eim.quant |
Quantitative EIM values. These values are labeled analoguously as those in |
prediction.error |
Overall out-of-bag prediction error. For classification this is the fraction of misclassified samples, for probability estimation the Brier score, for regression the mean squared error and for survival one minus Harrell's C-index. This is 'NA' for data sets with more than 100 covariate variables, because for such data sets we pre-select the 5000 variable pairs with strongest indications of interaction effects. This pre-selection cannot be taken into account in the out-of-bag error estimation, which is why the out-of-bag error estimates would be (much) too optimistic for data sets with more than 100 covariate variables. |
forest |
Saved forest (If write.forest set to TRUE). Note that the variable IDs in the |
confusion.matrix |
Contingency table for classes and predictions based on out-of-bag samples (classification only). |
chf |
Estimated cumulative hazard function for each sample (survival only). |
survival |
Estimated survival function for each sample (survival only). |
splitrule |
Splitting rule. |
treetype |
Type of forest/tree. classification, regression or survival. |
r.squared |
R squared. Also called explained variance or coefficient of determination (regression only). Computed on out-of-bag data. |
call |
Function call. |
importance.mode |
Importance mode used. |
num.samples |
Number of samples. |
replace |
Sample with replacement. |
eim.quant.rawlists |
List containing the four vectors of un-adjusted 'raw' quantitative EIM values
and the four vectors of adjusted EIM values. These are usually not required by the user. |
promispairs |
List giving the indices of the variables in the pre-selected variable pairs. If the number of variables is at most 100, all variable pairs are considered. |
plotres |
List ob objects needed by the plot functions: |
Roman Hornung, Marvin N. Wright
Hornung, R., Boulesteix, A.-L. (2022). Interaction forests: Identifying and exploiting interpretable quantitative and qualitative interaction effects. Computational Statistics & Data Analysis 171:107460, <doi:10.1016/j.csda.2022.107460>.
Hornung, R. (2022). Diversity forests: Using split sampling to enable innovative complex split procedures in random forests. SN Computer Science 3(2):1, <doi:10.1007/s42979-021-00920-1>.
Peto, R., (1982) Statistical aspects of cancer trials. In: K.E. Halnam (Ed.), Treatment of Cancer. Chapman & Hall: London.
Wright, M. N., Ziegler, A. (2017). ranger: A fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software 77:1-17, <doi:10.18637/jss.v077.i01>.
Breiman, L. (2001). Random forests. Machine Learning 45:5-32, <doi:10.1023/A:1010933404324>.
Malley, J. D., Kruppa, J., Dasgupta, A., Malley, K. G., & Ziegler, A. (2012). Probability machines: consistent probability estimation using nonparametric learning machines. Methods of Information in Medicine 51:74-81, <doi:10.3414/ME00-01-0052>.
Meinshausen (2006). Quantile Regression Forests. Journal of Machine Learning Research 7:983-999.
predict.divfor
, plot.interactionfor
, plotEffects
## Not run: ## Load package: library("diversityForest") ## Set seed to make results reproducible: set.seed(1234) ## Construct interaction forests and calculate EIM values: # Binary outcome: data(zoo) modelcat <- interactionfor(dependent.variable.name = "type", data = zoo, num.trees = 20) # Metric outcome: data(stock) modelcont <- interactionfor(dependent.variable.name = "company10", data = stock, num.trees = 20) # Survival outcome: library("survival") mgus2$id <- NULL # 'mgus2' data set is contained in the 'survival' package # categorical variables need to be of factor format - important!! mgus2$sex <- factor(mgus2$sex) mgus2$pstat <- factor(mgus2$pstat) # Remove the second time variable 'ptime': mgus2$ptime <- NULL # Remove missing values: mgus2 <- mgus2[complete.cases(mgus2),] # Take subset to make the calculations less computationally # expensive for the example (in actual applications, we would of course # use the whole data set): mgus2sub <- mgus2[sample(1:nrow(mgus2), size=500),] # Apply 'interactionfor': modelsurv <- interactionfor(formula = Surv(futime, death) ~ ., data=mgus2sub, num.trees=20) # NOTE: num.trees = 20 (in the above) would be much too small for practical # purposes. This small number of trees was simply used to keep the # runtime of the example short. # The default number of trees is num.trees = 20000 if EIM values are calculated # and num.trees = 2000 otherwise. ## Inspect the rankings of the variables and variable pairs with respect to ## the univariable, quantitative, and qualitative EIM values: # Univariable EIM values: modelcat$eim.univ.sorted # Pairs with top quantitative EIM values: modelcat$eim.quant.sorted[1:5] # Pairs with top qualitative EIM values: modelcat$eim.qual.sorted[1:5] ## Investigate visually the forms of the interaction effects of the variable pairs with ## largest quantitative and qualitative EIM values: plot(modelcat) plotEffects(modelcat, type="quant") # type="quant" is default. plotEffects(modelcat, type="qual") ## Prediction: # Separate 'zoo' data set randomly in training # and test data: data(zoo) train.idx <- sample(nrow(zoo), 2/3 * nrow(zoo)) zoo.train <- zoo[train.idx, ] zoo.test <- zoo[-train.idx, ] # Construct interaction forest on training data: # NOTE again: num.trees = 20 is specified too small for practical purposes. modelcattrain <- interactionfor(dependent.variable.name = "type", data = zoo, importance = "none", num.trees = 20) # NOTE: Because we are only interested in prediction here, we do not # calculate EIM values (by setting importance = "none"), because this # speeds up calculations. # Predict class values of the test data: pred.zoo <- predict(modelcattrain, data = zoo.test) # Compare predicted and true class values of the test data: table(zoo.test$type, pred.zoo$predictions) ## End(Not run)
## Not run: ## Load package: library("diversityForest") ## Set seed to make results reproducible: set.seed(1234) ## Construct interaction forests and calculate EIM values: # Binary outcome: data(zoo) modelcat <- interactionfor(dependent.variable.name = "type", data = zoo, num.trees = 20) # Metric outcome: data(stock) modelcont <- interactionfor(dependent.variable.name = "company10", data = stock, num.trees = 20) # Survival outcome: library("survival") mgus2$id <- NULL # 'mgus2' data set is contained in the 'survival' package # categorical variables need to be of factor format - important!! mgus2$sex <- factor(mgus2$sex) mgus2$pstat <- factor(mgus2$pstat) # Remove the second time variable 'ptime': mgus2$ptime <- NULL # Remove missing values: mgus2 <- mgus2[complete.cases(mgus2),] # Take subset to make the calculations less computationally # expensive for the example (in actual applications, we would of course # use the whole data set): mgus2sub <- mgus2[sample(1:nrow(mgus2), size=500),] # Apply 'interactionfor': modelsurv <- interactionfor(formula = Surv(futime, death) ~ ., data=mgus2sub, num.trees=20) # NOTE: num.trees = 20 (in the above) would be much too small for practical # purposes. This small number of trees was simply used to keep the # runtime of the example short. # The default number of trees is num.trees = 20000 if EIM values are calculated # and num.trees = 2000 otherwise. ## Inspect the rankings of the variables and variable pairs with respect to ## the univariable, quantitative, and qualitative EIM values: # Univariable EIM values: modelcat$eim.univ.sorted # Pairs with top quantitative EIM values: modelcat$eim.quant.sorted[1:5] # Pairs with top qualitative EIM values: modelcat$eim.qual.sorted[1:5] ## Investigate visually the forms of the interaction effects of the variable pairs with ## largest quantitative and qualitative EIM values: plot(modelcat) plotEffects(modelcat, type="quant") # type="quant" is default. plotEffects(modelcat, type="qual") ## Prediction: # Separate 'zoo' data set randomly in training # and test data: data(zoo) train.idx <- sample(nrow(zoo), 2/3 * nrow(zoo)) zoo.train <- zoo[train.idx, ] zoo.test <- zoo[-train.idx, ] # Construct interaction forest on training data: # NOTE again: num.trees = 20 is specified too small for practical purposes. modelcattrain <- interactionfor(dependent.variable.name = "type", data = zoo, importance = "none", num.trees = 20) # NOTE: Because we are only interested in prediction here, we do not # calculate EIM values (by setting importance = "none"), because this # speeds up calculations. # Predict class values of the test data: pred.zoo <- predict(modelcattrain, data = zoo.test) # Compare predicted and true class values of the test data: table(zoo.test$type, pred.zoo$predictions) ## End(Not run)
interactionfor
objectsPlot function for interactionfor
objects that allows to obtain a first overview of the result of the
interaction forest analysis. This function visualises the distributions of the EIM values and
the estimated forms of the bivariable influences of the variable pairs with largest quantitative and
qualitative EIM values. Further visual exploration of the result of the interaction
forest analysis should be conducted using plotEffects
.
## S3 method for class 'interactionfor' plot(x, numpairsquant = 2, numpairsqual = 2, ...)
## S3 method for class 'interactionfor' plot(x, numpairsquant = 2, numpairsqual = 2, ...)
x |
Object of class |
numpairsquant |
The number of pairs with largest quantitative EIM values to plot. Default is 2. |
numpairsqual |
The number of pairs with largest qualitative EIM values to plot. Default is 2. |
... |
Further arguments passed to or from other methods. |
For details on the plots of the estimated forms of the bivariable influences of the variable pairs see plotEffects
.
NOTE: The p-values shown in the plots are generally much too optimistic and MUST NOT be reported
as the result of a statistical test for significance of interaction. To obtain adjusted p-values that would correspond to
valid tests, it would be possible to multiply these p-values by the number of possible variable pairs,
which would correspond to Bonferroni-adjusted p-values. See the 'Details' section of plotEffects
for further
explanation and guidance. Note, however, that these Bonferroni-adjusted p-values should be interpreted
with caution because, stemming from bivariable models, these p-values do not take the multivariable nature
of the data into account.
NOTE ALSO: As described in Hornung & Boulesteix (2022), in the case of data with larger numbers of variables (larger than 100, but more seriously
for high-dimensional data), the univariable EIM values can be biased. Therefore, it is strongly recommended to interpret the univariable EIM values
with caution, if the data are high-dimensional. If it is of interest to measure the univariable importance of the variables for high-dimensional data,
an additional conventional random forest (e.g., using the ranger
package) should be constructed and the variable importance measure values
of this random forest be used for ranking the univariable effects.
A ggplot2 plot.
Roman Hornung
Hornung, R., Boulesteix, A.-L. (2022). Interaction forests: Identifying and exploiting interpretable quantitative and qualitative interaction effects. Computational Statistics & Data Analysis 171:107460, <doi:10.1016/j.csda.2022.107460>.
Hornung, R. (2022). Diversity forests: Using split sampling to enable innovative complex split procedures in random forests. SN Computer Science 3(2):1, <doi:10.1007/s42979-021-00920-1>.
## Not run: ## Load package: library("diversityForest") ## Set seed to make results reproducible: set.seed(1234) ## Construct interaction forest and calculate EIM values: data(stock) model <- interactionfor(dependent.variable.name = "company10", data = stock, num.trees = 20) # NOTE: num.trees = 20 (in the above) would be much too small for practical # purposes. This small number of trees was simply used to keep the # runtime of the example short. # The default number of trees is num.trees = 20000 if EIM values are calculated # and num.trees = 2000 otherwise. ## When using the plot() function without further specifications, ## by default the estimated bivariable influences of the two pairs with largest quantitative ## and qualitative EIM values are shown: plot(model) # It is, however, also possible to change the numbers of # pairs with largest quantitative and qualitative EIM values # to be shown: plot(model, numpairsquant = 4, numpairsqual = 3) ## End(Not run)
## Not run: ## Load package: library("diversityForest") ## Set seed to make results reproducible: set.seed(1234) ## Construct interaction forest and calculate EIM values: data(stock) model <- interactionfor(dependent.variable.name = "company10", data = stock, num.trees = 20) # NOTE: num.trees = 20 (in the above) would be much too small for practical # purposes. This small number of trees was simply used to keep the # runtime of the example short. # The default number of trees is num.trees = 20000 if EIM values are calculated # and num.trees = 2000 otherwise. ## When using the plot() function without further specifications, ## by default the estimated bivariable influences of the two pairs with largest quantitative ## and qualitative EIM values are shown: plot(model) # It is, however, also possible to change the numbers of # pairs with largest quantitative and qualitative EIM values # to be shown: plot(model, numpairsquant = 4, numpairsqual = 3) ## End(Not run)
This function allows to visualise the (estimated) bivariable influences of pairs of variables (with large quantitative and qualitative EIM values) on the outcome. This step is crucial, because to interpret interaction effects between variable pairs with large quantitative and qualitative EIM values, it is necessary to learn about the forms these interaction effects take.
plotEffects( intobj, type = "quant", numpairs = 5, indpairs = NULL, pairs = NULL, allwith = NULL, pvalues = TRUE, twoplots = TRUE, addtitles = TRUE, plotit = TRUE )
plotEffects( intobj, type = "quant", numpairs = 5, indpairs = NULL, pairs = NULL, allwith = NULL, pvalues = TRUE, twoplots = TRUE, addtitles = TRUE, plotit = TRUE )
intobj |
Object of class |
type |
This can be either "quant" or "qual" and determines whether the plotted pairs are sorted according to either the quantitative or qualitative EIM values in decreasing order. Default is "quant". |
numpairs |
The number of pairs to plot (default: 5). This is overwritten by |
indpairs |
Optional. The indices of the pairs in the sorted lists of quantitative ( |
pairs |
This can be used to specify the pairs to plot. It is an optional list of character string vectors, where each of these vectors has length two. Each list element corresponds to one
pair, where the first character string gives the name of the first member of the respective pair to plot and the second character string gives the name of the second member.
This argument overwrites |
allwith |
This is an optional character string that can be set to the name of one of the variables. If provided, only variable pairs will
be considered that feature the variable specified by this argument |
pvalues |
Set to |
twoplots |
Set to |
addtitles |
Set to |
plotit |
This states whether the plots are actually plotted or merely returned as |
For each considered pair the bivariable influence of both pair members on the outcome estimated
using a two-dimensional flexible function is shown. Such visualisations make it possible to learn
about the forms of the interaction effects between variable pairs with large EIM values. Moreover,
these visualisations reveal (pathological) cases in which variable pairs do not show
indications of interaction effects despite featuring large EIM values.
For binary outcomes the probabilities for the second class are estimated, for categorical outcomes with more than two classes
the probabilities for the largest class (i.e., the class with the most observations) are estimated (using the function plotPair
, a different
class can be selected instead), for metric outcomes the means of the outcome are estimated, and for survival outcomes
the log hazards ratio values compared to the median effect are estimated.
The kinds of estimates shown differ also according to whether both pair members are metric or only one of the two
members is metric and the other one categorical or both pair members are categorical:
If both pair members are metric and the outcome is categorical or metric we use two-dimensional LOESS regression, where in the case of categorical outcomes, to obtain probability estimates for the first class (or largest class for multi-class outcomes), we use the value '1' for the first class (largest class for multi-class outcomes) and the value '0' for the second class (all other classes for multi-class outcomes).
If both pair members are metric and the outcome is
survival we use a Cox proportional hazard additive model with a two-dimensional LOESS smooth (gamcox
function
from the 'MapGAM' package (version 1.2-5)) and in the rare cases for which the latter fails, we use classical Cox regression
with an interaction term between the two covariates.
If one pair member is metric and the other one categorical and the outcome is categorical or metric, we use LOESS regression between the outcome (coded as '0' and '1' in the case of categorical outcomes as described above) and the values of the metric variable separately for each category of the categorical variable. In the rare cases in which the LOESS regression fails we use classical linear regression.
If one pair member is metric and the other one categorical and the outcome is survival, we use Cox regression with a linear tail-restricted cubic spline with four knots (univariable LOESS regression for survival outcomes does not seem to be available yet in R) separately for each category of the categorical variable. In cases in which the fitting of this spline regression fails we use classical Cox regression.
If both pair members are categorical and the outcome is categorical or metric, we simply calculate the mean of the outcome (coded as '0' and '1' in the case of categorical outcomes as described above) for each possible combination of the categories of the two variables.
If both pair members are categorical and the outcome is survival, we use classical Cox regression with an interaction term between the two variables (there is no need for any flexible modelling in this setting, because the Cox model with two categorical variables plus interaction term is saturated).
As described above (function argument: pvalues
), there is an option to add p-values from tests for interaction effect
to the plots. If at least one of the variables in the considered variable pair is categorical and features more
than two categories, there are more than one interaction terms in the regression approaches used for testing,
because the categorical variables are dummy-coded. Therefore, in these cases we obtain a p-value for each interaction term.
to obtain a single p-value for the test for interaction we adjust these multiple p-values using the Holm-Bonferroni
procedure and take the minimum of the adjusted p-values.
NOTE: These p-values are generally much too optimistic, in particular for small data sets and
large numbers of variables. The reason for this overoptimism is that these p-values are not adjusted
for the fact that we already used the data to find the variable pairs with strongest indications of
interaction effects. This is similar to a multiple testing problem.
Therefore, these p-values should only be seen as a rough guide to be interpreted very cautiously
and MUST NOT be reported as the results of a statistical test for significance
of interaction. To obtain adjusted p-values that would correspond to
valid tests, it would be possible to multiply these p-values by the number of possible pairs,
which would correspond to Bonferroni-adjusted p-values. For example, assume that we have 30
covariate variables. In that case the number of possible pairs would be 'choose(30, 2) = 435',
which is why we would need to multiply each p-value by 435
to obtain an adjusted p-value (or keep the original p-values and divide the significance
level 0.05 by 435). Note, however, that Bonferroni-adjusted p-values deliver quite conservative results,
that is, weaker effects might not be detected using these p-values, while, however, effects for
which these p-values are small () are most likely relevant.
Note further that these Bonferroni-adjusted p-values should be interpreted
with caution because, stemming from bivariable models, these p-values do not take the
multivariable nature of the data into account.
A list of ggplot2 plots returned invisibly.
Roman Hornung
Hornung, R., Boulesteix, A.-L. (2022). Interaction forests: Identifying and exploiting interpretable quantitative and qualitative interaction effects. Computational Statistics & Data Analysis 171:107460, <doi:10.1016/j.csda.2022.107460>.
Hornung, R. (2022). Diversity forests: Using split sampling to enable innovative complex split procedures in random forests. SN Computer Science 3(2):1, <doi:10.1007/s42979-021-00920-1>.
## Not run: ## Load package: library("diversityForest") ## Set seed to make results reproducible: set.seed(1234) ## Construct interaction forest and calculate EIM values: data(stock) model <- interactionfor(dependent.variable.name = "company10", data = stock, num.trees = 20) # NOTE: num.trees = 20 (in the above) would be much too small for practical # purposes. This small number of trees was simply used to keep the # runtime of the example short. # The default number of trees is num.trees = 20000 if EIM values are calculated # and num.trees = 2000 otherwise. ## Obtain a first overview by applying the plot() function for ## interactionfor obects: plot(model) ## Several possible application cases of the plotEffects() function: # Visualise the estimated bivariable influences of the five variable pairs with the # largest quantitative EIM values: plotEffects(model) # type="quant" is default. # Visualise the estimated bivariable influences of the five pairs with the # largest qualitative EIM values: plotEffects(model, type="qual") # Visualise the estimated bivariable influences of all (eight) pairs that involve # the variable "company7" sorted in decreasing order according to the # qualitative EIM values: plotEffects(model, allwith="company7", type="qual", numpairs=8) # Visualise the estimated bivariable influences of the pairs with third and fifth # largest qualitative EIM values: plotEffects(model, type="qual", indpairs=c(3,5)) # Visualise the estimated bivariable influences of the pairs ("company3", "company5") and # ("company1", "company9"): plotEffects(model, pairs=list(c("company3", "company5"), c("company1", "company9"))) ## Saving of plots generated with the plotEffects() function (e.g., for use in publications): # Apply plotEffects() to obtain plots for the five variable pairs # with the largest qualitative EIM values and store these plots in # an object 'ps': ps <- plotEffects(model, type="qual", pvalues=FALSE, twoplots=FALSE, addtitles=FALSE, plotit=FALSE) # pvalues = FALSE states that no p-values should be shown in the plots, # because these might not be desired in plots meant for publication. # twoplots = FALSE ensures that we get one plot for each page instead of two plots per page. # addtitles = FALSE removes the automatically generated titles, because these are likely # not desired in publications. # plotit = FALSE ensures that the plots are not displayed, but only returned (invisibly) # by plotEffects(). # Save the plot with second largest qualitative EIM value: p1 <- ps[[2]] # Add title: library("ggpubr") p1 <- annotate_figure(p1, top = text_grob("My descriptive plot title 1", face = "bold", size = 14)) p1 # Save as PDF: # library("ggplot2") # ggsave(file="mypathtofolder/FigureXY1.pdf", width=14, height=6) # Save the plot with fifth largest qualitative EIM value: p2 <- ps[[5]] # Add title: p2 <- annotate_figure(p2, top = text_grob("My descriptive plot title 2", face = "bold", size = 14)) p2 # Save as PDF: # ggsave(file="mypathtofolder/FigureXY1.pdf", width=14, height=6) # Combine both of the above plots: p <- ggarrange(p1, p2, nrow = 2) p # Save the combined plot: # ggsave(file="mypathtofolder/FigureXYcombined.pdf", width=14, height=11) # NOTE: Using plotEffects() it is not possible to change the plots # themselves (by e.g., increasing the label sizes or changing the # axes ranges). However, the function plotPair() can be used to change # the plots themselves. ## End(Not run)
## Not run: ## Load package: library("diversityForest") ## Set seed to make results reproducible: set.seed(1234) ## Construct interaction forest and calculate EIM values: data(stock) model <- interactionfor(dependent.variable.name = "company10", data = stock, num.trees = 20) # NOTE: num.trees = 20 (in the above) would be much too small for practical # purposes. This small number of trees was simply used to keep the # runtime of the example short. # The default number of trees is num.trees = 20000 if EIM values are calculated # and num.trees = 2000 otherwise. ## Obtain a first overview by applying the plot() function for ## interactionfor obects: plot(model) ## Several possible application cases of the plotEffects() function: # Visualise the estimated bivariable influences of the five variable pairs with the # largest quantitative EIM values: plotEffects(model) # type="quant" is default. # Visualise the estimated bivariable influences of the five pairs with the # largest qualitative EIM values: plotEffects(model, type="qual") # Visualise the estimated bivariable influences of all (eight) pairs that involve # the variable "company7" sorted in decreasing order according to the # qualitative EIM values: plotEffects(model, allwith="company7", type="qual", numpairs=8) # Visualise the estimated bivariable influences of the pairs with third and fifth # largest qualitative EIM values: plotEffects(model, type="qual", indpairs=c(3,5)) # Visualise the estimated bivariable influences of the pairs ("company3", "company5") and # ("company1", "company9"): plotEffects(model, pairs=list(c("company3", "company5"), c("company1", "company9"))) ## Saving of plots generated with the plotEffects() function (e.g., for use in publications): # Apply plotEffects() to obtain plots for the five variable pairs # with the largest qualitative EIM values and store these plots in # an object 'ps': ps <- plotEffects(model, type="qual", pvalues=FALSE, twoplots=FALSE, addtitles=FALSE, plotit=FALSE) # pvalues = FALSE states that no p-values should be shown in the plots, # because these might not be desired in plots meant for publication. # twoplots = FALSE ensures that we get one plot for each page instead of two plots per page. # addtitles = FALSE removes the automatically generated titles, because these are likely # not desired in publications. # plotit = FALSE ensures that the plots are not displayed, but only returned (invisibly) # by plotEffects(). # Save the plot with second largest qualitative EIM value: p1 <- ps[[2]] # Add title: library("ggpubr") p1 <- annotate_figure(p1, top = text_grob("My descriptive plot title 1", face = "bold", size = 14)) p1 # Save as PDF: # library("ggplot2") # ggsave(file="mypathtofolder/FigureXY1.pdf", width=14, height=6) # Save the plot with fifth largest qualitative EIM value: p2 <- ps[[5]] # Add title: p2 <- annotate_figure(p2, top = text_grob("My descriptive plot title 2", face = "bold", size = 14)) p2 # Save as PDF: # ggsave(file="mypathtofolder/FigureXY1.pdf", width=14, height=6) # Combine both of the above plots: p <- ggarrange(p1, p2, nrow = 2) p # Save the combined plot: # ggsave(file="mypathtofolder/FigureXYcombined.pdf", width=14, height=11) # NOTE: Using plotEffects() it is not possible to change the plots # themselves (by e.g., increasing the label sizes or changing the # axes ranges). However, the function plotPair() can be used to change # the plots themselves. ## End(Not run)
This function allows to visualise the (estimated) bivariable influence of a single specific pair of variables on the outcome. The estimation
and plotting is performed in the same way as in plotEffects
. However, plotPair
does not require an interactionfor
object
and can thus be used also without a constructed interaction forest.
plotPair( pair, yvarname, statusvarname = NULL, data, levelsorder1 = NULL, levelsorder2 = NULL, categprob = NULL, pvalue = TRUE, returnseparate = FALSE, intobj = NULL )
plotPair( pair, yvarname, statusvarname = NULL, data, levelsorder1 = NULL, levelsorder2 = NULL, categprob = NULL, pvalue = TRUE, returnseparate = FALSE, intobj = NULL )
pair |
Character string vector of length two, where the first character string gives the name of the first member of the respective pair to plot and the second character string gives the name of the second member.
Note that the order of the two pair members in |
yvarname |
Name of outcome variable. |
statusvarname |
Name of status variable, only applicable to survival data. |
data |
Data frame containing the variables. |
levelsorder1 |
Optional. Order the categories of the first variable should have in the plot (if it is categorical). Character string vector, where the i-th entry contains the name of the category that should take the i-th place in the ordering of the categories of the first variable. |
levelsorder2 |
Optional. Order the categories of the second variable should have in the plot (if it is categorical). Character string vector specified in an analogous
way as |
categprob |
Optional. Only relevant for categorical outcomes with more than two classes.
Name of the class for which probabilities should be estimated. As described in |
pvalue |
Set to |
returnseparate |
Set to |
intobj |
Optional. Object of class |
See the 'Details' section of plotEffects
.
A ggplot2 plot.
Roman Hornung
Hornung, R., Boulesteix, A.-L. (2022). Interaction forests: Identifying and exploiting interpretable quantitative and qualitative interaction effects. Computational Statistics & Data Analysis 171:107460, <doi:10.1016/j.csda.2022.107460>.
Hornung, R. (2022). Diversity forests: Using split sampling to enable innovative complex split procedures in random forests. SN Computer Science 3(2):1, <doi:10.1007/s42979-021-00920-1>.
plotEffects
, plot.interactionfor
## Not run: ## Load package: library("diversityForest") ## Visualise the estimated bivariable influence of 'toothed' and 'feathers' on ## the probability of type="mammal": data(zoo) plotPair(pair = c("toothed", "feathers"), yvarname="type", data = zoo) ## Visualise the estimated bivariable influence of 'creat' and 'hgb' on ## survival (more precisely, on the log hazards ratio compared to the ## median effect): library("survival") mgus2compl <- mgus2[complete.cases(mgus2),] plotPair(pair=c("creat", "hgb"), yvarname="futime", statusvarname = "death", data=mgus2compl) # Problem: The outliers in the left plot make it difficult to see what is going # on in the region with creat values smaller than about two even though the # majority of values lie there. # --> Solution: We re-run the above line setting returnseparate = TRUE, because # this allows to get the two ggplot plots separately, which can then be manipulated # to change the x-axis range in order to remove the outliers: ps <- plotPair(pair=c("creat", "hgb"), yvarname="futime", statusvarname = "death", data=mgus2compl, returnseparate = TRUE) # Change the x-axis range: library("ggplot2") ps[[1]] + xlim(c(0.5,2)) # Save the plot: # ggsave(file="mypathtofolder/FigureXY1.pdf", width=7, height=6) # We can, for example, also change the label sizes of the second plot: # With original label sizes: ps[[2]] # With larger label sizes: ps[[2]] + theme(axis.title=element_text(size=15)) # Save the plot: # library("ggplot2") # ggsave(file="mypathtofolder/FigureXY2.pdf", width=7, height=6) ## End(Not run)
## Not run: ## Load package: library("diversityForest") ## Visualise the estimated bivariable influence of 'toothed' and 'feathers' on ## the probability of type="mammal": data(zoo) plotPair(pair = c("toothed", "feathers"), yvarname="type", data = zoo) ## Visualise the estimated bivariable influence of 'creat' and 'hgb' on ## survival (more precisely, on the log hazards ratio compared to the ## median effect): library("survival") mgus2compl <- mgus2[complete.cases(mgus2),] plotPair(pair=c("creat", "hgb"), yvarname="futime", statusvarname = "death", data=mgus2compl) # Problem: The outliers in the left plot make it difficult to see what is going # on in the region with creat values smaller than about two even though the # majority of values lie there. # --> Solution: We re-run the above line setting returnseparate = TRUE, because # this allows to get the two ggplot plots separately, which can then be manipulated # to change the x-axis range in order to remove the outliers: ps <- plotPair(pair=c("creat", "hgb"), yvarname="futime", statusvarname = "death", data=mgus2compl, returnseparate = TRUE) # Change the x-axis range: library("ggplot2") ps[[1]] + xlim(c(0.5,2)) # Save the plot: # ggsave(file="mypathtofolder/FigureXY1.pdf", width=7, height=6) # We can, for example, also change the label sizes of the second plot: # With original label sizes: ps[[2]] # With larger label sizes: ps[[2]] + theme(axis.title=element_text(size=15)) # Save the plot: # library("ggplot2") # ggsave(file="mypathtofolder/FigureXY2.pdf", width=7, height=6) ## End(Not run)
Prediction with new data and a saved forest from divfor
.
## S3 method for class 'divfor' predict( object, data = NULL, predict.all = FALSE, num.trees = object$num.trees, type = "response", se.method = "infjack", quantiles = c(0.1, 0.5, 0.9), seed = NULL, num.threads = NULL, verbose = TRUE, ... )
## S3 method for class 'divfor' predict( object, data = NULL, predict.all = FALSE, num.trees = object$num.trees, type = "response", se.method = "infjack", quantiles = c(0.1, 0.5, 0.9), seed = NULL, num.threads = NULL, verbose = TRUE, ... )
object |
|
data |
New test data of class |
predict.all |
Return individual predictions for each tree instead of aggregated predictions for all trees. Return a matrix (sample x tree) for classification and regression, a 3d array for probability estimation (sample x class x tree) and survival (sample x time x tree). |
num.trees |
Number of trees used for prediction. The first |
type |
Type of prediction. One of 'response', 'se', 'terminalNodes', 'quantiles' with default 'response'. See below for details. |
se.method |
Method to compute standard errors. One of 'jack', 'infjack' with default 'infjack'. Only applicable if type = 'se'. See below for details. |
quantiles |
Vector of quantiles for quantile prediction. Set |
seed |
Random seed. Default is |
num.threads |
Number of threads. Default is number of CPUs available. |
verbose |
Verbose output on or off. |
... |
further arguments passed to or from other methods. |
This package is a fork of the R package 'ranger' that implements random forests using an
efficient C++ implementation. More precisely, 'diversityForest' was written by modifying
the code of 'ranger', version 0.11.0. Therefore, details on further functionalities
of the code that are not presented in the help pages of 'diversityForest' are found
in the help pages of 'ranger' (version 0.11.0). The code in the example sections of divfor
and tunedivfor
can be
used as a template for all common application scenarios with respect to classification,
regression and survival prediction using univariable, binary splitting. Some function
arguments adopted from the 'ranger' package may not be useable with diversity forests
(for the current package version).
Object of class divfor.prediction
with elements
predictions |
Predicted classes/values (only for classification and regression) |
unique.death.times |
Unique death times (only for survival). |
chf |
Estimated cumulative hazard function for each sample (only for survival). |
survival |
Estimated survival function for each sample (only for survival). |
num.trees |
Number of trees. |
num.independent.variables |
Number of independent variables. |
treetype |
Type of forest/tree. Classification, regression or survival. |
num.samples |
Number of samples. |
Marvin N. Wright
Hornung, R. (2022). Diversity forests: Using split sampling to enable innovative complex split procedures in random forests. SN Computer Science 3(2):1, <doi:10.1007/s42979-021-00920-1>.
Wright, M. N., Ziegler, A. (2017). ranger: A fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software 77:1-17, <doi:10.18637/jss.v077.i01>.
Wager, S., Hastie T., & Efron, B. (2014). Confidence Intervals for Random Forests: The Jackknife and the Infinitesimal Jackknife. Journal of Machine Learning Research 15:1625-1651.
Meinshausen (2006). Quantile Regression Forests. Journal of Machine Learning Research 7:983-999.
Prediction with new data and a saved interaction forest from interactionfor
.
## S3 method for class 'interactionfor' predict( object, data = NULL, predict.all = FALSE, num.trees = object$num.trees, type = "response", se.method = "infjack", quantiles = c(0.1, 0.5, 0.9), seed = NULL, num.threads = NULL, verbose = TRUE, ... )
## S3 method for class 'interactionfor' predict( object, data = NULL, predict.all = FALSE, num.trees = object$num.trees, type = "response", se.method = "infjack", quantiles = c(0.1, 0.5, 0.9), seed = NULL, num.threads = NULL, verbose = TRUE, ... )
object |
|
data |
New test data of class |
predict.all |
Return individual predictions for each tree instead of aggregated predictions for all trees. Return a matrix (sample x tree) for classification and regression, a 3d array for probability estimation (sample x class x tree) and survival (sample x time x tree). |
num.trees |
Number of trees used for prediction. The first |
type |
Type of prediction. One of 'response', 'se', 'terminalNodes', 'quantiles' with default 'response'. See below for details. |
se.method |
Method to compute standard errors. One of 'jack', 'infjack' with default 'infjack'. Only applicable if type = 'se'. See below for details. |
quantiles |
Vector of quantiles for quantile prediction. Set |
seed |
Random seed. Default is |
num.threads |
Number of threads. Default is number of CPUs available. |
verbose |
Verbose output on or off. |
... |
further arguments passed to or from other methods. |
Note that his package is a fork of the R package 'ranger' that implements random forests using an efficient C++ implementation. The documentation is in large parts taken from 'ranger', where some parts of the documentation may not apply to (the current version of) the 'diversityForest' package. Details on further functionalities of the code that are not presented in the help pages of 'diversityForest' are found in the help pages of 'ranger' (version 0.11.0).
Object of class interaction.prediction
with elements
predictions |
Predicted classes/values (only for classification and regression) |
unique.death.times |
Unique death times (only for survival). |
chf |
Estimated cumulative hazard function for each sample (only for survival). |
survival |
Estimated survival function for each sample (only for survival). |
num.trees |
Number of trees. |
num.independent.variables |
Number of independent variables. |
treetype |
Type of forest/tree. Classification, regression or survival. |
num.samples |
Number of samples. |
Marvin N. Wright, Roman Hornung
Hornung, R., Boulesteix, A.-L. (2022). Interaction forests: Identifying and exploiting interpretable quantitative and qualitative interaction effects. Computational Statistics & Data Analysis 171:107460, <doi:10.1016/j.csda.2022.107460>.
Hornung, R. (2022). Diversity forests: Using split sampling to enable innovative complex split procedures in random forests. SN Computer Science 3(2):1, <doi:10.1007/s42979-021-00920-1>.
Wright, M. N., Ziegler, A. (2017). ranger: A fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software 77:1-17, <doi:10.18637/jss.v077.i01>.
Wager, S., Hastie T., & Efron, B. (2014). Confidence Intervals for Random Forests: The Jackknife and the Infinitesimal Jackknife. Journal of Machine Learning Research 15:1625-1651.
Meinshausen (2006). Quantile Regression Forests. Journal of Machine Learning Research 7:983-999.
This data set contains 950 daily stock prices from January 1988 through October 1991, for ten aerospace companies.
The names of the companies are anonymised and the stock prices for one of these companies (company10
) were
flagged as the outcome variable. Thus, for this data set, both the outcome and the covariates were metric.
A data frame with 950 observations, nine covariates and one metric outcome variable
The variables are as follows: covariates: company1
, ..., company9
, outcome variable: company10
.
OpenML: data.name: stock, data.id: 223, link: https://www.openml.org/d/223/
Vanschoren, J., van Rijn, J. N., Bischl, B., Torgo, L. (2013). OpenML: networked science in machine learning. SIGKDD Explorations 15(2):49-60, <doi:10.1145/2641190.2641198>.
## Load data: data(stock) ## Dimension of data: dim(stock) ## First rows of data: head(stock)
## Load data: data(stock) ## Dimension of data: dim(stock) ## First rows of data: head(stock)
nsplits
and proptry
First, both for nsplits
and proptry
a grid of possible values may be provided,
where default grids are used if no grids are provided. Second, for each pairwise combination of
values from these two grids a forest is constructed. Third,
that pair of nsplits
and proptry
values is used as the optimized set of parameter
values that is associated with the smallest out-of-bag prediction error. If several pairs of
parameter values are associated with the same smallest out-of-bag prediction error, the
pair with the smallest (parameter) values is used.
tunedivfor( formula = NULL, data = NULL, nsplitsgrid = c(2, 5, 10, 30, 50, 100, 200), proptrygrid = c(0.05, 1), num.trees.pre = 500 )
tunedivfor( formula = NULL, data = NULL, nsplitsgrid = c(2, 5, 10, 30, 50, 100, 200), proptrygrid = c(0.05, 1), num.trees.pre = 500 )
formula |
Object of class |
data |
Training data of class |
nsplitsgrid |
Grid of values to consider for |
proptrygrid |
Grid of values to consider for |
num.trees.pre |
Number of trees used for each forest constructed during tuning parameter optimization. Default is 500. |
List with elements
nsplitsopt |
Optimized value of |
proptryopt |
Optimized value of |
tunegrid |
Two-dimensional |
ooberrs |
The out-of-bag prediction errors obtained for each pair of values considered for |
Roman Hornung
Hornung, R. (2022). Diversity forests: Using split sampling to enable innovative complex split procedures in random forests. SN Computer Science 3(2):1, <doi:10.1007/s42979-021-00920-1>.
Wright, M. N., Ziegler, A. (2017). ranger: A fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software 77:1-17, <doi:10.18637/jss.v077.i01>.
## Load package: library("diversityForest") ## Set seed to obtain reproducible results: set.seed(1234) ## Tuning parameter optimization for the iris data set: tuneres <- tunedivfor(formula = Species ~ ., data = iris, num.trees.pre = 20) # NOTE: num.trees.pre = 20 is specified too small for practical # purposes - the out-of-bag error estimates of the forests # constructed during optimization will be much too variable!! # In practice, num.trees.pre = 500 (default value) or a # larger number should be used. tuneres tuneres$nsplitsopt tuneres$proptryopt tuneres$tunegrid tuneres$ooberrs
## Load package: library("diversityForest") ## Set seed to obtain reproducible results: set.seed(1234) ## Tuning parameter optimization for the iris data set: tuneres <- tunedivfor(formula = Species ~ ., data = iris, num.trees.pre = 20) # NOTE: num.trees.pre = 20 is specified too small for practical # purposes - the out-of-bag error estimates of the forests # constructed during optimization will be much too variable!! # In practice, num.trees.pre = 500 (default value) or a # larger number should be used. tuneres tuneres$nsplitsopt tuneres$proptryopt tuneres$tunegrid tuneres$ooberrs
This data set describes 101 different biological species using 16 simple attributes,
where 15 of these are binary and one is metric (the number of legs). The outcome
"mammal vs. other" (type
) is binary.
A data frame with 101 observations, 16 covariates and one binary outcome variable
The variables are as follows:
hair
. factor. Presence of hairs (true = yes; false = no)
feathers
. factor. Presence of feathers (true = yes; false = no)
eggs
. factor. Does the species lay eggs? (true = yes; false = no)
milk
. factor. Does the species give milk? (true = yes; false = no)
airborne
. factor. Does the species fly? (true = yes; false = no)
aquatic
. factor. Does the species live in the water? (true = yes; false = no)
predator
. factor. Is the species a predator? (true = yes; false = no)
toothed
. factor. Presence of teeth (true = yes; false = no)
backbone
. factor. Presence of backbone (true = yes; false = no)
breathes
. factor. Does the species breathe with lungs? (true = yes; false = no)
venomous
. factor. Is the species venomous? (true = yes; false = no)
fins
. factor. Presence of fins (true = yes; false = no)
legs
. metric. Number of legs
tail
. factor. Presence of tail (true = yes; false = no)
domestic
. factor. Is the species domestic? (true = yes; false = no)
catsize
. factor. Is the species large? (true = yes; false = no)
type
. factor. Binary outcome variable - type of species ('mammal' vs. 'other')
The original openML dataset contains an additional variable animal
, which is removed in this
version of the data set. This variable provided the names of all species.
OpenML: data.name: zoo, data.id: 965, link: https://www.openml.org/d/965/
Vanschoren, J., van Rijn, J. N., Bischl, B., Torgo, L. (2013). OpenML: networked science in machine learning. SIGKDD Explorations 15(2):49-60, <doi:10.1145/2641190.2641198>.
Dua, D., Graff, C. (2019) UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science. https://archive.ics.uci.edu/ml/.
##' Load data: data(zoo) ##' Numbers of observations in the two classes: table(zoo$type) ##' Dimension of data: dim(zoo) ##' First rows of data: head(zoo)
##' Load data: data(zoo) ##' Numbers of observations in the two classes: table(zoo$type) ##' Dimension of data: dim(zoo) ##' First rows of data: head(zoo)