1. Introduction

In this tutorial, we’ll show a method for estimating the effects of the depth and the number of trees on the performance of a random forest.

2. Problem Statement

Random Forest is an ensemble of Decision Trees. We train them separately and output their average prediction or majority vote as the forest’s prediction.

However, we need to set the hyper-parameters that affect learning before training the trees. In particular, we need to decide on the number of trees (n) and their maximal depth (d_{\max}).

In this article, we won’t deal with setting \boldsymbol{n} and \boldsymbol{d_{\max}} but estimating their effects on the forest’s performance. We’re interested in finding out how strong their influence is, whether or not they interact, and what type of relationship they have with the performance score. Does it change linearly with d_{\max} and n, or is the dependence more complex? Which hyper-parameter is more important? Are both influential, or is one of them irrelevant?

3. The Effects Estimation Methodology

Let Y \equiv Y_{d_{\max}, n} be the test score of a random forest model. For example, we can use the forest’s accuracy or AUROC on the test set in classification tasks. In regression problems, we can use the mean squared error. 

To estimate the influence of d_{\max} and n on the forest, we need to check if its test scores vary with the changes in the hyper-parameters. So, we evaluate Y at their various values and infer their effects based on the experimental results. But, this begs the question: what is an effect in statistics?

3.1. On Effects and Beyond

There’s no universally accepted definition of a factor’s effect on the response variable’s values. Which one we go with depends on what we deem the most natural way to describe “an effect” and our capabilities to measure it empirically.

For example, let’s say we record the response Y at two levels of a factor A, one “low” (a_{low}) and the other “high” (a_{high}). We can define the A‘s effect on Y as the change in the mean values of Y as A increases from the low to the high values. Alternatively, we can use the medians instead of means or define the effect as the probability that a random value we draw from the distribution of Y \mid A = a_{high} is higher than a randomly selected value of Y \mid A = a_{low}.

In this article, we’ll go with the differences in mean values of \boldsymbol{Y} as we vary \boldsymbol{d_{\max}} and \boldsymbol{n}. The reason is that it’s the most common approach in practice, and the theory on the design of experiments with like-defined effects is well-established.

3.2. The Bigger Picture: the Nuisance Factors

The variability of test scores isn’t due only to \boldsymbol{d_{\max}} and \boldsymbol{n}. The partitioning of data into training and test sets also affects performance. What’s more, the process of training forest models is stochastic in itself. So, we can get different outcomes by training the forest using the same data multiple times.

That means we should consider other sources of variability, not just the hyper-parameters d_{\max} and n whose effects we want to determine, and evaluate \Phi_{d_{\max}, n} at different levels of the other factors. They’re called nuisance factors and represent the variability sources we acknowledge but whose values we don’t plan to control and analyze in our experiment. Let’s list a few possible candidates:

  • seed – the random seed;
  • size_{train} – a number \in (0, 1) showing how much data we use for training;
  • the function to measure the quality of a split (e.g., the entropy or Gini impurity);
  • the minimal number of objects in a leaf, etc.

To keep things simple, we won’t consider them all. **We’ll use only \boldsymbol{seed} and \boldsymbol{size_{train}} as the nuisance factors.
**

So, our estimation procedure will be like this:

algorithm EstimateEffects(D, N, dataset):
    // INPUT
    //    D = the pairs (d, n) of the maximal depth and the number of trees
    //    N = the Graeco-Latin square for seed and size_train
    //    dataset = the dataset
    // OUTPUT
    //    Analyzes the effects of d_max and n and generates plots

    Y <- make a storage for the results

    for (d, n) in D:
        for j <- 1 to m^2:
            Split the data using the seed and size_train from the j-th pair in N
            Train a random forest with n trees of maximal depth d
            Y[d, n, j] <- evaluate the forest on the test set
    
    Estimate the effects and make plots
    Analyze the effects and the plots

    return

3.3. Setting the Nuisance Factors

We should set the nuisance factors in a way that makes their effects on performance distinguishable from those of d_{\max} and n. To do so, we evaluate each combination of \boldsymbol{d_{\max}} and \boldsymbol{n} on the same block of nuisance values’ pairs. Then, we can construct those pairs using a Graeco-Latin Square. So, if we have m seeds: seed_1, seed_2, \ldots, seed_m and the same number of train set sizes: size_1, size_2, \ldots, size_m, we arrange them in a m \times m matrix so that all the pairs are different and each value appears only once in each column and each row.

For m=3, we’d have a square like this:

    [\begin{matrix} (seed_1, size_1) & (seed_2, size_3) & (seed_3, size_2) \\ (seed_2, size_2) & (seed_3, size_1) & (seed_1, size_3) \\ (seed_3, size_3) & (seed_1, size_2) & (seed_2, size_1) \end{matrix}]

In the language of the design of experiments theory, we’d say that we block for \boldsymbol{seed} and \boldsymbol{size_{train}} .

Further, we should choose the nuisance values randomly in a way that doesn’t bias the effects of the depth and the number of trees to specific ranges of the nuisance factors’ values. To do so, we can divide the ranges of seed ([0, MAX\_INT]) and size_{train} (e.g., (0.7, 0.8)) into m equally spaced intervals. Then, we randomly select a value from each interval.

3.4. How Many Nuisance Values?

We explained how to arrange selected values into m \times m blocks. However, we first need to determine \boldsymbol{m}, the number of values we’ll work with.

That choice shouldn’t be arbitrary since the statistical power of analysis depends on it. We can define the maximal acceptable 1-\alpha confidence interval (CI) width (w_{threshold}) around the means of Y for any choice of d_{\max} and n and the selected level 1-\alpha (usually, 80\%, 95\%, or 99\%). Then, we set m to the smallest value that keeps the width below the threshold.

We need the standard deviation \sigma of Y to calculate the width. If Y \in [0, 1] (which is the case if we use a normalized score such as accuracy), sigma can be at most 1. So, we find the smallest m for which:

    [\frac{2 \sigma z_{\alpha/2}}{\sqrt{m}} = \frac{2 z_{\alpha/2}}{\sqrt{m}} \leq w_{threshold}]

where z_{\alpha/2} is the appropriate critical value for the standard normal distribution. The value of m calculated this way will overestimate the number of values needed because the actual \sigma is certainly \leq 1. If we can justify other choices of \sigma such as 1/2 or 1/4, we can also apply this method. To find such \sigma, we rely on theory and the results of previous research. However, if they’re not available, we should conduct a preliminary experiment to estimate \sigma. Then, we use the estimate to find m.

3.5. Setting the \boldsymbol{d_{\max}} and \boldsymbol{n} Values

Much in this step depends on our assumptions and the scope of questions we want to answer. In practice, we’ll define \ell values of d_{\max}: d_1 \leq d_2 \leq \ldots \leq d_{\ell} and n: n_1, n_2, \ldots, n_{\ell}.

If we’re interested only in the behavior of Y for some particular choices of d_{\max} and n, we use them as our d_i and n_i (i=1,2,\ldots, \ell). In that case, our conclusions will be restricted only to those values, and we won’t be able to justify extrapolation to other choices of d_{\max} and n.

On the other hand, if we want to generalize inference to any d_{\max} and n in the corresponding ranges, then we should select the d_i and n_i randomly and treat the factors as random variables.

3.6. The Main and Simple Effects

How many values we choose depends on the dependence type we’d like to detect and the assumptions we make in modeling Y. While doing so, we distinguish between the main and simple effects.

The main effect of a factor shows how the response \boldsymbol{Y} changes when we vary the factor irrespective of the other factor’s values. For instance, we estimate the main effect of d_{\max} by calculating the means of Y for d_1, d_2, \ldots, d_{\ell}. If we’re reasonably sure the dependence of Y on d_{\max} may be quadratic at most, we need only three values of d_{\max}. In general, \ell + 1 values are needed for capturing a dependence of order \ell \in \{1,2,\ldots}.

A simple effect illustrates how changing a factor affects \boldsymbol{Y} at a particular value of another factor or a combination of values of two and more other factors. So, if the simple effects of d_{\max} at all the values of n are the same, we conclude that d_{\max} and n don’t interact.

These expectations and assumptions translate to a particular mathematical model of Y.

3.7. Formalizing Effects with Models

For example, let’s say we assume that the effects of d_{\max} and n are linear and that two factors may interact. Our model could look like this:

(1)   \begin{equation*} Y_{d, n} = \beta_0 + \beta_{1} d + \beta_{2} n + \beta_{1,2} dn + Error_{d, n} \end{equation*}

where Y_{d, n, j} is a random variable modelling the test score of a random forest with n trees of maximal depth d, and Error_{d, n} is a zero-mean normal variable representing the variability in Y not accounted for by d_{\max} and n.  Usually, we assume that the Error_{d, n} variables are identically distributed but we can go for complex models. The coefficients \beta_1, \beta_2, \beta_{12} denote the effects. If any of them is close to zero (which we can check using a hypothesis test or inspecting the CIs), we may conclude that the corresponding effect doesn’t exist.

Visually, \beta_1 = 0 would mean that the means of Y for the two choices of d_{\max} (d_{low} and d_{high}) are more or less the same and their confidence intervals overlap. If \beta_{12} = 0, we expect the line connecting the mean Y values at evaluated at (d_{low}, n_{low}) and (d_{high}, n_{low}) to be parallel to the line connecting the means at (d_{low}, n_{high}) and (d_{high}, n_{high})

3.8. Lower-Order Effects Are Usually Sufficient

If we want to examine super-linear dependencies and effects, we use \boldsymbol{\ell > 2} values of \boldsymbol{d_{\max}} and \boldsymbol{n} . However, we’ll quickly run into problems because we’ll train and test m^2 forests for each of the \ell^2 combinations of d_{\max} and n. Depending on the size of our dataset, that could take a lot of time.

To address this issue, we can adopt a less ambitious experimental plan. Instead of aiming to test for an order-13 relationship, for which we’ll need \ell=14, we can settle for a lower-order limit. Much of the time, linear and quadratic models are reasonably accurate approximations of the real-world processes scientists examine. In our case, we could say that the coefficients of d^3,  d^2 n, d n^2, n^3, and all higher-order terms are zero or so close to zero that their effects on Y_{d, n} are negligible.

Therefore, our final model could be:

(2)   \begin{equation*}  Y_{d, n} = \beta_0 + \beta_1 d + \beta_{11} d^2 + \beta_{2} n + \beta_{22} n^2 + \beta_{12} dn + Error \qquad Error \sim Normal(0, \sigma^2) \end{equation*}

3.9. The Results

So, with \ell=3, we’ll have three values of d_{\max} and n. Let’s code them as L (low), M (middle), and H (high). Our results will make a table of the form:

    [\begin{matrix} d_{\max} & n & j & Y \\ L & L &  1 & y_{LL1} \\ L & L & 2 & y_{LL2} \\ \vdots \\ L & L & m & y_{LLm} \\ L & M & 1 & y_{LM1} \\ L & M & 2 & y_{LM2} \\ \vdots \\ L & M & m & y_{LMm} \\ \vdots \\ H & H & 1 & y_{HH1} \\ \vdots \\ H & H & m & y_{HHm} \end{matrix}]

where j represents a combination of seed and size_{train} and the subscripted ys are the recorded test scores. So, we replicate each block of m^2 nuisance pairs across all 3^2 combinations of d_{\max} and n, train a forest, evaluate its test score, and then use the results to make plots and estimate the \beta coefficients in the model (2).

4. Words of Caution

If our assumptions don’t hold, we may draw invalid conclusions. For instance, if we limit the model to linear main effects, we may conclude there’s no effect even though a quadratic relationship between d_{\max} and Y is strong:

Quadratic Effects - Linear Models

Further, how to define an effect is still an open question in the statistical analysis of algorithms. Using differences in means as effects makes an implicit assumption that the mean values are representative of the distributions of Y_{d, n}. That may not be the case with every random forest and dataset.

We proposed determining m based on the confidence interval width around a single mean. However, the confidence interval around the difference in means may have been more appropriate.

In our final model, d_{\max} and n have fixed effects. That means that we shouldn’t generalize the conclusions to the ranges of values outside of those tested. If we want to do so, we should go with random-effects models. They treat the effects as random variables and allow extrapolation to unseen values.

We deliberately didn’t mention statistical significance, although it’s a common practice to conduct null-hypothesis significance tests (NHSTs). A growing body of literature criticizes the NHST framework. The main issue is that the subtleties of the NHST methods are prone to misinterpretation and confuse the majority of practitioners. That doesn’t mean that we shouldn’t use the NHSTs, but we should be careful and know what we’re doing.

Finally, we should keep in mind that fitting a forest may take a long time, so even if we go for a simpler model, estimating the effects could prove infeasible. In that case, we should reduce the size of the training set.

5. How to Estimate Effects: A Quick Summary

Here’s a quick list of steps to estimate the effects of d_{\max} and n:

  1. First, we should select the values of d_{\max} and n to evaluate.
    • If we assume that the relationships are linear, we need only 2 values for both parameters (d_{low} and d_{high}, n_{low} and n_{high}).
    • To detect a dependency of order \ell > 1, we need \ell + 1 values.
    • Setting \ell=3 or \ell=4 should suffice because higher-order effects are usually negligible.
  2. Then, we select the values of the nuisance factors we’ll block for.
    • We use the same number of values for each nuisance factor (m).
    • To find m, we minimize the width of the CIs around the means.
  3. We arrange the values of the nuisance factors in a block and replicate it across all the pairs of the maximal depth and number of trees. This way, we get our experimental design.
  4. We train a random forest for each combination of values in the design and record the score on the test set.
  5. Now, we plot the mean scores for each depth d and the number of trees n in the design. We also visualize their confidence intervals.
  6. If all the confidence intervals overlap, the corresponding effect is void.
  7. If the intervals don’t overlap, the shape of the line reveals the type of dependence: null, linear, quadratic, cubic, or of a higher order.
  8. We fit a regression model to the results (including the terms based on the plots and assumptions).
  9. The absolute values of the coefficients represent the corresponding effect sizes.
    • We can test them for significance but be aware that significance testing has flaws.

We can vary some steps if we need, but the procedure to estimate the effects of \boldsymbol{d_{\max}} and \boldsymbol{n} on a random forest’s performance will look more or less like the one above.

6. Conclusion

In this article, we presented a methodology for estimating how the depth and number of trees affect the performance of a random forest. We defined their hyper-parameters effects through differences in means, but there may be more appropriate effect definitions;;


« 上一篇: 基数排序