Random Forests are an increasingly popular machine learning algorithim. They perform well in a wide variety of learning and prediction problems despite being exceedingly easy to implement. Random forests can be used for both classification and regression problems.
This post describes the intuition of how Random Forests work as well as its. advantages and disadvantages.
At a high level, the algorithim usually works as follows:
Repeat (1) and (2) B times, where B is the number of trees (also called bags). Depending on the nature of data, the ideal number of trees ranging from 100 to several thousand.
SKlearn will allow you to use all the cores in your computer to train multiple trees at the same time. The algorithim is also available in Spark. This is because all random forests are trained independently of eachother.
The only way to overfit the model is to have trees that have too many branches. Increasing the number of trees does not increase the risk of overfitting. Instead an increasing number of trees tends to decrease the amount of overfitting
We can also avoid cross-validation by using the out of bag score. This is available as an argument in Spark and Sklearn and provides nearly identical results to N-fold cross-validation.2
Gradient Boosted Machines tend to perform better under both of the following conditions:
Since predicting a new observation requires running the observation through every tree, Random Forests will often perform too poorly for real-time prediction.
In this case, SVM and Naive Bayes tend to perform better than Random Forests. One of the reasons for this is because each tree only has access to features by default. If very few features are of any importance, most trees will miss important features.
Below is an example of using a Random Forest with 50 trees:
model_rf = RandomForestClassifier(n_estimators=200, oob_score=True, verbose=1,
random_state=2143, min_samples_split=50,
n_jobs=-1)
fitted_model = model_rf.fit(X, y)
The oob_score
argument tells the classifier to return our out of sample (bag) error estimates. The min_samples_split=50
argument tells the classifer to only create another branch of the tree if the current branch has more than 50 observations. n_jobs=-1
tells SKlearn to use all the cores on my machine.
An example of using Random Forests in Spark can be found here. Here is the same example as in SKlearn:
min_samples_split
argument in SKLearn.