# A simple mathematical guide to classification Trees using sklearn | Decision Trees

**Updated on:**April 11, 2020 · 15 mins read

1. Classification Error Rate for Classification Trees

2. Entropy in Classification tree

3. Information Gain in classification trees

4. Gini Index in Classification Trees

5. Gini Gain in Classification Trees

6. Using Entropy and Information Gain to create Decision tree nodes

7. Using Gini Index and Gini Gain to create Decision tree nodes

8. Classification Trees using Sklearn

As we already know from our previous discussion on Regression Trees, that tree algorithms are Greedy in nature which means they tend to choose the better node now, rather than choosing a node that will create a better tree later.
Also, in contrast to the regression tree model where the predicted response of any given node is the mean of all the observations in the region, in classification trees predicted response is the most commonly occurring observation in the region to which it belongs.
Read the following post for more details.
2. Entropy in Classification tree

3. Information Gain in classification trees

4. Gini Index in Classification Trees

5. Gini Gain in Classification Trees

6. Using Entropy and Information Gain to create Decision tree nodes

7. Using Gini Index and Gini Gain to create Decision tree nodes

8. Classification Trees using Sklearn

### A Beginner's guide to Regression Trees using Sklearn | Decision Trees

#python
#machinelearning
#datascience
#sklearn

April 3, 2020
10 mins read

- Classification Error rate
- Entropy
- Gini Index

## Classification Error Rate for Classification Trees

Classification Error rate is simply the fraction of the training observations in that region that do not belong to the most common class. Mathematically,
$E = 1 - max(p(k))$

$Where\ p(k)\ is\ the\ proportion\ of\ training\ observations\ in\ the\ mth\ region\ that\ are\ from\ the\ kth\ class$

Classification error rate is not used generally because it is not sensitive for tree-growing, therefore, `Entropy`

or `Gini index`

is used instead.
## Entropy in Classification tree

It’s the measure of amount of uncertainty in the data(Randomness). Higher the uncertainty, higher is the entropy. The value of entropy is`zero`

when there is no uncertainty in some event. For example, if we are tossing a coin having heads on both sides.
Mathematically, entropy is given by
$H(s) =\displaystyle \sum_{x \epsilon X} p(x) log_2 \frac{1}{p(x)}$

$where\ p(x)\ is\ the\ proportion\ of\ occurring\ of\ some\ event$

For example, for a simple coin toss, the probability is `1/2`

.
## Information Gain in classification trees

This is the value gained for a given set`S`

when some feature `A`

is selected as a node of the tree.
While selecting any node for the tree generation we want to maximize the Information Gain at that given point.
Information gain is given as the change in the Entropy before and after selecting any given feature as the node of the tree.
Mathematically, it is given as,
$IG(S, A) = H(S) - H(S, A)$

$IG(S, A) = H(S) - \displaystyle \sum_{i=0}^{n} P(x) * H(x)$

$where\ H(S)\ is\ the\ Entropy\ of\ entire\ Set$

$and\ \sum_{i=0}^{n} p(x) * H(x)\ is\ the\ Entropy\ after\ applying\ feature\ x\ where\ P(x)\ is\ the\ proportion\ of\ event\ x$

We will discuss it further while creating the model using Information Gain.
Information gain is the value of entropy that we removed after adding a node to the tree.

## Gini Index in Classification Trees

This is the default metric that the Sklearn Decision Tree classifier tends to increase. It is used to quantify the split made in the tree at any given moment of node selection. Mathematically, gini index is given by,
$G = \displaystyle \sum_{k=1}^{K} P(k)(1 - P(k))$

$Where\ P(k)\ is\ the\ proportion\ of\ training\ instances\ with\ class\ k$

Minimum value that the Gini index can have is 0.For example, A coin having heads on both sides will give Gini Index as 0.

$(1 * (1 - 1)) = 0$

Gini index also tells about the purity of node selection. If a node selected is very pure the value of Gini index will be less.
## Gini Gain in Classification Trees

As we have information gain in the case of entropy, we have Gini Gain in case of the Gini index. It is the amount of Gini index we gained when a node is chosen for the decision tree. We will take an example to understand these terms in little more detail. Let’s consider the following data source,```
| Outlook | Temperature | Humidity | Wind | Played |
|----------|-------------|----------|--------|--------|
| Sunny | Hot | High | Weak | No |
| Sunny | Hot | High | Strong | No |
| Overcast | Hot | High | Weak | Yes |
| Rain | Mild | High | Weak | Yes |
| Rain | Cold | Normal | Weak | Yes |
| Rain | Cold | Normal | Strong | No |
| Overcast | Cold | Normal | Strong | Yes |
| Sunny | Mild | High | Weak | No |
| Sunny | Cold | Normal | Weak | Yes |
| Rain | Mild | Normal | Weak | Yes |
| Sunny | Mild | Normal | Strong | Yes |
| Overcast | Mild | High | Strong | Yes |
| Overcast | Hot | Normal | Weak | Yes |
| Rain | Mild | High | Strong | No |
```

## Using Entropy and Information Gain to create Decision tree nodes

### Calculate the overall entropy

```
Total yes cases = 9
Total No cases = 5
Total Cases = 14
```

$\frac{9}{14}\log _{2} \frac{14}{9} + \frac{5}{14}\log _{2} \frac{14}{5}$

$= 0.940$

More the value of entropy inclined toward 1, more is the randomness in the data.As we have already evaluated the value of total entropy, let’s calculate the information gain while choosing each and every feature separately. Let’s start with the

`wind`

feature.
$IG(S, Wind) = H(S) - \sum _{i=0}^{n} P(x) * H(x)$

```
Total Weak wind cases = 8
Total Strong wind cases = 6
Total Cases = 14
```

#### Entropy for Weak wind

$H(S_{weak}) = \frac{6}{8} \log_{2}\frac{8}{6} + \frac{2}{8} \log_{2}\frac{8}{2}$

$= 0.811$

#### Entropy for Strong wind

$H(S_{strong}) = \frac{3}{6} \log_{2}\frac{6}{3} + \frac{3}{6} \log_{2}\frac{6}{3}$

$= 1.00$

Total Information Gain can be calculated as follows,
$IG(S, Wind) = H(S) - P(S_{weak}) * H(S_{weak}) - P(S_{strong}) * H(S_{strong})$

$= 0.940 - \frac {8}{14} (0.811) - \frac{6}{14}(1.00)$

$= 0.048$

`IG`

for other features as well and select the one which produces the highest value of `IG`

.
We will continue this process until leaf nodes are reached for every branch created.
## Using Gini Index and Gini Gain to create Decision tree nodes

### Calculate the overall Gini impurity

```
Total yes cases = 9
Total No cases = 5
Total Cases = 14
```

$GI(S) = \frac{9}{14}(1 - \frac{9}{14}) + \frac{5}{14}(1 - \frac{5}{14})$

$= 0.46$

As we have already evaluated the value of total Gini impurity, let’s calculate the Gini gain while choosing each and every feature separately.
Let’s start with the `wind`

feature.
```
Total Weak wind cases = 8
Total Strong wind cases = 6
Total Cases = 14
```

#### Gini Index for Weak wind

$GI(S_{weak}) = \frac{6}{8}(1 - \frac{6}{8}) + \frac{2}{8}(1 - \frac{2}{8})$

$= \frac{3}{8}$

#### Gini Index for Strong wind

$GI(S_{Strong}) = \frac{3}{6}(1 - \frac{3}{6}) + \frac{3}{6}(1 - \frac{3}{6})$

$= \frac{1}{2}$

Total Gini Gain can be calculated as follows,
$GG(S_{wind}) = GI(S) - \frac{8}{14} * GI(S_{weak}) - \frac{6}{14} * GI(S_{strong})$

$GG(S_{wind}) = 0.46 - 0.214 - 0.214$

$= 0.0314$

`GG`

for other features as well and select the one which produces the highest value of `GG`

.
This is the basic understanding of Classification Trees.
## Classification Trees using Sklearn

We will be using sklearn to train a model of breast cancer data.### Import everything

```
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.tree import DecisionTreeClassifier
%matplotlib inline
```

### Load data and split it into training and testing data

```
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
```

### Initialize the model with default criterion (Gini Index)

```
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
```

### Predict the model and check the Reports

```
predictions = clf.predict(X_test)
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))
```

### Using Entropy as the criterion for predicting classification Model

```
clf_entropy = DecisionTreeClassifier(criterion='entropy')
clf_entropy.fit(X_train, y_train)
predictions_entropy = clf_entropy.predict(X_test)
print(confusion_matrix(y_test, predictions_entropy))
print(classification_report(y_test, predictions_entropy))
```

**Please share your Feedback:**

Did you enjoy reading or think it can be improved? Don’t forget to leave your thoughts in the comments section below! If you liked this article, please share it with your friends, and read a few more!