What is random forests?

    Random forests [1] have very similar performance compared to boosting on many problems, but they are easier to train and tune. Trees are usually noisy which is why bagging comes in handy. Each tree is identically distributed in bagging. The bias of the bagged trees will because of that have the same value as the individual trees.

    Random forests main idea is to improve the variance reduction by reducing the correlation between the trees it is built upon. By letting trees grow through a process of random selection of the input variables this can be achieved.

    Random forests and tree based models in general have little need for feature normalization. Random forests often work well without complicated setup and provides more robust results than a single decision tree. They work very well for tabulated data but not images, signals or text. They can be computationally heavy depending on how many different trees they are made up of, and are obviously more computationally heavy than a single decision tree. They are as well not as easy to interpret as single decision trees.

    Scikit-learn has a random forest classifier [2] and a random forest regressor [3] .

    The learning algorithm

    Each tree in the ensemble is trained on its own training set through the technique bagging. Instead of consider all the possible features of the data set we only consider a subset, typically , when we are building a tree branch.

    Prediction

    For regression we use the average value as the output.

    For classification we use voting or averaging of the probabilities as output.

    Hyperparameters

    We can choose how many trees we want our model to consist of. More trees result in a slower model but more accurate because we take advantage of the fundamental idea of ensembles in relation to decision trees. We can also choose how many features we should consider when we build new tree nodes. The standard hyperparameters for decision trees apply here as well.

    References