What is an ensemble?

    Ensembles [1] are machine learning models that combine several other models. Ensembles often have great accuracy. If an ensemble is built by many different classifiers each with an accuracy of 0.6. The errors of each model are independent. Thus, there is a greater probability that the classifiers (because of diversity) can complement each other. It is often not very realistic to assume that the errors are independent .

    An ensemble could use the average of all the different models as the final prediction value. It could also use a technique called stacking .

    Scikit-learn has many useful methods for managing ensembles [2] .

    Training

    One idea is called bagging . In that approach we take the original training set and sample from it. Thus, we get a number of new training sets. It is normally done with replacement which may result in one instance ending up many times in the new sample or not at all. Another useful techique is called spinning . It craetes new training sets by randomly picking subsets of features. This is often done without replacement. We could combine both ideas.

    Boosting

    There are two popular boosting algorithms, AdaBoost and gradient boosting , where AdaBoost being the most popular. Both are supported by Scikit-learn [3] [4] [5] . AdaBoost works by giving each instances which are misclassified a higher importance after each iteration before the next one. Unlike bagging, we may overfit the model if we add more trees to a boosting algorithm.

    References