Random Forests: At a Glance

6 min readMar 4, 2022

A “Random Forest” is an algorithm based on trees. In fact, the random forest is a series of decision trees. With this method we inspect each split of each decision tree. We use randomness to avoid any correlation between the trees. Correlation in regards to data often implies association but not necessarily accuracy. This is an issue because while correlation can result in precise results that are closely related, precision can be entirely inaccurate. In this article I plan to explore the general use, history, and ethics of this algorithm.

A Little History:

The general method of random forests is to create a number of decision trees, each with a different random seed. The trees are then combined to form a forest. The trees are then used to make predictions. This method was first proposed by Tin Kam Ho of AT&T Laboratories in 1995. He was interested in the speed at which decision trees could be executed and sought to find a way to scale them to a higher level of complexity without losing generalization accuracy on unseen data.

In 1997 Yamil Amit & Donald Geman would further expand on this. In an academic journal titled “ Shape Quantization And Recognition With Randomized Trees” the men introduce the idea of searching over a random subset of the available decisions when splitting a node and using a virtually infinite family of trees with a smaller dataset. They argued that while other structures that were more rigid and complex in nature provided more precise results, it was explicitly because their forests were primitive and simplistic that it was a much more stable method. Finally, Thomas G. Dietterich introduced the idea of randomized node optimization. This was the method by which the decision at each node is selected by a randomized procedure, rather than a deterministic optimization.

What is it good for?

Because random forest is a predictive algorithm, it is often used in situations where pattern recognition and analysis are seen as important. Not surprisingly, two areas where this is quite important would be banking and e-commerce. In the banking system, random forests are used to predict the actions of borrowers in an effort to root out those who would be the most likely to commit fraud. In less severe cases it’s also used to determine whether or not a borrower will be likely to make repayments. On a macro scale economists use random forests to predict the likelihood of systemic banking crises’ in an effort to prevent them. The algorithm is seen as one of few that is less susceptible to overfitting and skewing by outliers.

As an e-commerce tool, I’m sure you’ve come across a recommended product once or twice while shopping online. That recommendation system is likely using a random forest. Random forest regression is used to examine previous decisions by customers and companies to make predictions for the purposes of sales. Companies are often unconcerned with the circumstances surrounding the sale of an item, and due to the generalizing nature of the random forest, this allows them to optimize recommendations for the largest subset of potential buyers.

Some insight:

We’ve talked a fair bit about the benefits of using random forests but it does have a few caveats. While it’s incredibly useful for generalization and prediction, it also provides no descriptive information around those predictions. It has no way of detailing why or how the data involved is related. This is by design, but should still be noted. While accurate, the information gained from using this algorithm can be less than useful in cases where you want to consider outliers and the correlations between data points. Additionally, when it comes to regression the algorithm cannot provide precise continuous predictions, as it is constrained to the range of the data that it was trained on.

An Ethical Conundrum…

Ethically speaking, random forests can be considered a hot topic, especially in the medical community. While the use of algorithms like this in banking could be considered negative by some, in healthcare, random forests are a matter of life and death.

Random forests are frequently used in the medical field for the purposes of diagnosis, disease prevention, and health insurance billing. The program provides doctors with pattern recognition tools that are unbiased in terms of reviewing a patient’s medical history and are highly accurate when predicting potential health complications. This has led to lives in some cases, but in many more cases, it’s resulted in higher premiums and denials. Some doctors and many patients argue that using machines to predict patient health and determine healthcare costs is unethical. The algorithms are incredibly generalizing and completely ignore outliers. In the event of a patient with a rare genetic disorder, extenuating or extraordinary circumstances, or thin medical history, this could and has led to misdiagnosis, price gouging, and coverage termination. While this tool has helped many, due to the unique nature of human physiology it is still being debated as to whether or not the benefits of its use outweigh the negatives. People rarely live lives similar enough to justify the hyper generalization of their circumstances. Today, in most situations outside of risk assessment, random forest classification is used in conjunction with human diagnosis and personal diagnosticians. In terms of risk assessment, it is still more often than not the only tool implemented or required to negatively impact the healthcare cost of an individual. I believe that with human intervention, this is an excellent tool that benefits humanity, but it also needs severe regulation much like the rest of the insurance industry.

Let’s have a look:

Now that might have been a little intense, but let’s take a quick simple look at what coding a random forest even looks like. We’ll be replicating a tutorial using the Scikit Learn Iris dataset.

First let’s import the data set

#Import scikit-learn dataset library 
from sklearn import datasets  #Load dataset 
iris = datasets.load_iris()

Next we create a dataframe

# Creating a DataFrame of given iris dataset. 
import pandas as pd 
data=pd.DataFrame({
     'sepal length':iris.data[:,0],
     'sepal width':iris.data[:,1],
     'petal length':iris.data[:,2],
     'petal width':iris.data[:,3],
     'species':iris.target
})
data.head()

Next we separate the columns into features and labels, and then split those into our training and test sets.

# Import train_test_split function
from sklearn.model_selection import train_test_split X=data[['sepal length', 'sepal width', 'petal length', 'petal width']]  # Featuresy=data['species']  # Labels  # Split dataset into training set and test set X_train, X_test, y_train, y_test = train_test_split(X, y, 
test_size=0.3) # 70% training and 30% test#Import Random Forest Model 
from sklearn.ensemble import RandomForestClassifier  #Create a Gaussian Classifier clf=RandomForestClassifier(n_estimators=100)  #Train the model using the training sets y_pred=clf.predict(X_test) 
clf.fit(X_train,y_train)
  
y_pred=clf.predict(X_test)

Next we train the model on the training set and perform predictions on the test set.

#Import scikit-learn metrics module for accuracy calculation 
from sklearn import metrics # Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Conclusion:

Random forest are a fantastic algorithm with an interesting development history and a very mainstream adoption. It would be fair to assume that the way we use it will continue to evolve. Over the next few years it would be reasonable to expect the algorithm to be further developed and its use cases to expand.

Sources:

Donges, Niklas. “A Complete Guide to the Random Forest Algorithm.” Built In, 17 Sept. 2021, builtin.com/data-science/random-forest-algorithm.

Khalilia, Mohammed. “Predicting Disease Risks from Highly Imbalanced Data Using Random Forest — BMC Medical Informatics and Decision Making.” BioMed Central, 29 July 2011, bmcmedinformdecismak.biomedcentral.com/articles/10.1186/1472–6947–11–51.

“Sklearn.Datasets.Load_iris.” Scikit-Learn, scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html.

Wikipedia contributors. “Decision Tree.” Wikipedia, 20 Feb. 2022, en.wikipedia.org/wiki/Decision_tree.