Random Forest

Sandun Dayananda
4 min readApr 9, 2024

--

Mathematical and machine learning approach

random forest
random forest

Understanding Random Forests

Random Forest is a powerful and versatile machine learning algorithm that grows and combines multiple decision trees to create a “forest”. It’s an ensemble learning method, where the predictions from multiple models are combined to improve the overall performance.

The Mathematics Behind Random Forests

The fundamental idea behind a Random Forest is to combine the predictions made by many decision trees into a single model. Individually, predictions made by decision trees may not be accurate, but combined, they can provide a more accurate and stable prediction.

**Check my article on Decision Trees to understand more

Decision Trees

A decision tree is built on an entire dataset, using all the features/variables of data. The idea is to split the data based on homogeneity. A decision tree uses various measures to split the data. Two of the measures are Gini Impurity and Entropy.

Gini Impurity: Gini says, if we select two items from a population at random then they must be of the same class and probability for this is 1 if the population is pure. It works with the categorical target variable “Success” or “Failure”. It performs only Binary splits.

Entropy: Entropy is the measure of randomness in the data. In other words, it gives the impurity present in the dataset.

When we split our data into two regions and then split it again, the entropy of the system decreases. We want to decrease the entropy even more because the lower the entropy, the less disordered our data, and the better our model.

Random Forests

In Random Forest, we grow multiple trees as opposed to a single tree. To classify a new object based on attributes, each tree gives a classification. The forest chooses the classification that has the most votes (along all the trees in the forest) and when there is a regression, it takes the average of outputs by different trees.

Here is a basic example of how you can create a Random Forest model with Python’s Scikit-Learn library:

**you can see the full code in my GitHub repo

import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.tree import plot_tree
import eli5
from eli5.sklearn import PermutationImportance

#load the iris dataset
iris = load_iris()
X = iris.data
y = iris.target

#split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#create the model with 100 trees
model = RandomForestClassifier(n_estimators=100, bootstrap=True, max_features='sqrt')

#fit on training data
model.fit(X_train, y_train)

#predict for the test set
y_pred = model.predict(X_test)

#calculata accuracy score
accuracy = accuracy_score(y_test, y_pred)
print(f'Model Accuracy: {accuracy}')

#get the first decision tree from the forest to have some idea
first_tree = model.estimators_[0]

#visualize the tree
plt.figure(figsize=(15,10))
plot_tree(first_tree,
filled=True,
rounded=True,
class_names=iris.target_names,
feature_names=iris.feature_names)
plt.show()

#create a PermutationImportance object on the model and compute importances
perm = PermutationImportance(model, random_state=1).fit(X_test, y_test)

#feature importance
eli5.show_weights(perm, feature_names = iris.feature_names)

This script will generate a plot of the first decision tree in the Random Forest. Each node in the tree includes the feature (or attribute) and the threshold for splitting the data, the Gini impurity, the number of samples, the distribution of samples, and the dominant class.

output
output

Please note that visualizing the entire Random Forest could be complex and may not provide much insight due to the large number of trees typically involved in a Random Forest.

For more detailed visualisations and analysis of Random Forest models, consider using tools like eli5, SHAP, or treeinterpreter. These tools can provide insights into feature importances and the prediction paths followed by specific samples.

Here I have used eli5, a Python library. It has inbuilt support for several ML frameworks.

Here, using eli5 I have visualized feature importances in the model. The eli5.show_weights function visualizes the importance of each feature used in the trained Random Forest model.

In this context, a feature refers to a column in the dataset, or a property of the data. For the Iris dataset, the features are ‘sepal length’, ‘sepal width’, ‘petal length’, and ‘petal width’.

The PermutationImportance object perm has been fitted with the test data and the model. It has computed the importance of each feature — how much each feature contributes to the model’s predictions.

When you call eli5.show_weights(perm, feature_names = iris.feature_names), it displays a table of features ranked by their importance in making predictions with the Random Forest model.

Features that are more important in predicting the target variable will have higher weights. This can help you understand which features are driving the model’s predictions.

--

--

Sandun Dayananda

Big Data Engineer with passion for Machine Learning and DevOps | MSc Industrial Analytics at Uppsala University