A Beginner's guide to Regression Trees using Sklearn | Decision Trees
Greedy nature of decision trees
2. Equation of Regression Tree
3. Predictions in Regression Trees
4. Prediction using stratification of feature space
5. Disadvantages of predicting using stratification
6. Predicting using Tree Pruning
7. Regression Tree analysis using Sklearn
8. Finding the relation between Tree depth and Mean Square Error
Striker salary datasetexample. To simplify the calculations, let’s assume that we only have three features in the dataset,
Goals last seasonand
Salaryand salary is the feature that we want to predict. On this random data, we decided to fit the decision tree model. After initially analyzing the data, we found that
Experiencewas the most usable factor which was clearly dividing the whole dataset into different parts. So, rather than thinking of ahead in time, we will directly choose the
Experienceas the root node at this point. We will discuss the above example much in detail during this post.
Equation of Regression TreeAs we already know the equation of the linear regression model which is equal to the equation of a straight line. We also have an equation for Regression Tree.
Predictions in Regression TreesThere are a few different ways in which we can predict a regression tree. In this post, we are going to discuss,
- Prediction using Stratification of feature space
- Prediction using Tree Pruning
Prediction using stratification of feature spaceAs we have already discussed that while fitting trees we almost every time use the greedy approach. We will use the same approach here as well. At any given time of node(branch feature) determination, we want to reduce the value of the
Residual Sum of Squares(RSS). Mathematically the value is given by,
hitter's dataset. First, we plot the whole dataset onto a plan. For simplicity let’s consider that all the points are confined in a rectangle. In regression trees, at each point, we have to predict two values.
- Value of the branch feature (node).
- The cutoff value at which the reduction in RSS is minimum.
Agewas dividing the region in a better way. So, the branch feature(root node) that we chose was
Age. Also, we found that the cutoff value of 4.5 years was giving optimal results. We will apply the recursion and apply the same steps again. Since we now have two regions,
R1with salaries of
hitterswith experience < 4.5 years and
R2with experience >= 4.5 years. We now chose from these two available regions and try to divide them. The only other available feature is
number of hits. We found that there is a clear distinction between the salaries of hitters hitting
117+hits last year in region
R2. Finally, we chose to leave the
R1as it is as there was no clear advantage of dividing it further. This is what our Regression tree will look like.
Xjand the cutoff value
Disadvantages of predicting using stratification
- This algorithm might produce good results in the training data but is likely to overfit the data, leading to poor test set performance. This happens because the tree is too complex.
Machine learning models try to find a sweet spot between the overfit and underfit so that the error rate should be minimized and at the same time, it should be able to predict correctly.
Predicting using Tree PruningTree Pruning isn’t only used for regression trees. We also make use of it in the classification trees as well. As the word itself suggests, the process involves cutting the tree into smaller parts. We can do pruning in two ways.
- Pre-pruning or early stopping
- Post Pruning
Regression Tree analysis using SklearnLet’s import the required libraries
import matplotlib.pyplot as plt import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.datasets import load_boston from sklearn.tree import DecisionTreeRegressor from sklearn.metrics import mean_squared_error
bostondata, which is the house pricing data of the Boston region. You can know more about the data by running
X, y = load_boston(return_X_y=True) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
clf = DecisionTreeRegressor() clf.fit(X_train, y_train) predictions = clf.predict(X_test) print(mean_squared_error(y_test, predictions)) print(np.sqrt(mean_squared_error(y_test, predictions)))
Finding the relation between Tree depth and Mean Square ErrorWe can train the model with different depth values using a simple Python loop and then plot a relation between them.
mses =  for depth in range(1, (clf.tree_.max_depth + 1)): d_tree_reg = DecisionTreeRegressor(max_depth=depth) d_tree_reg.fit(X_train, y_train) tree_predictions = d_tree_reg.predict(X_test) mses.append(mean_squared_error(y_test, tree_predictions)) tree_depths = [depth for depth in range(1, (clf.tree_.max_depth + 1))] plt.figure(figsize=(10, 6)) plt.grid() plt.plot(tree_depths, mses) plt.xlabel("Tree Depth") plt.ylabel("Mean Square Error")
6, we will get the best results and minimum
Mean Square Error. Please share on social media and subscribe to the newsletter to read more such posts.
Did you enjoy reading or think it can be improved? Don’t forget to leave your thoughts in the comments section below! If you liked this article, please share it with your friends, and read a few more!