What is a decision tree?

    Decision tree [1] is a classic model of learning and well suited for binary classification. The decision tree is made up of guesses where each node represents a guess and the path to each node the binary decision. Each non-terminal node has two children which corresponds to answering "no" or "yes". The questions can be seen as features and the rating is called the label. Decision trees can output probabilities as well.

    Scikit-learn has decision tree models [2] [3] [4] .

    The learning algorithm

    The learning algorithm can be simplified to:

    • select the best feature F that corresponds to the best split (creating subsets)
    • create a (sub)tree with F as the root
    • repeat

    To not get an infinite recursion loop we have some base cases like checking similarity in each subset, checking max depth, etc.

    Best split

    It is best to split each set into homogeneous (little varaiation) subsets. There are a couple of techniques for determining the homogeneity of one set, for classification the most popular are: information gain or the gini score and for regression we can consider the variance (a subset with small variance).

    Drawbacks

    Decision trees tend to overfit and thus produce bad generalizations. They are usually not useful on their own because of this. However when used in ensembles e.g. in the popular model random forests they perform well.

    References