Decision Tree

A decision tree is a flowchart-like structure in which each internal node represents a “test” on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the outcome of the test and each leaf node represents a class label (decision taken after computing all attributes). The paths from root to leaf represents classification rules.


It has decision blocks and terminating blocks where some conclusion has been reached. The right and left arrows coming out of the decision blocks are known as branches, and they can lead to other decision blocks or to a terminating block.

Desicion Trees

  • Pros: Computationally cheap to use, easy for humans to understand learned results, missing values OK, can deal with irrelevant features
  • Cons: Prone to overfitting
  • Works with: Numeric values, nominal values



Check if every item in the dataset is in the same class:
If so return the class label
find the best feature to split the data
split the dataset
create a branch node
for each split
call createBranch and add the result to the branch node
return branch node

Information Gain

We choose to split our dataset in a way that makes our unorganized data more organized. One way to organize this messiness is to measure the information. The change in information before and after the split is known as the information gain. The split with the highest information gain is your best option. The measure of information of a set is known as the Shannon entropy, or just entropy for short. Its name comes from the father of information theory, Claude Shannon.

Entropy is defined as the expected value of the information. the information for symbol x is defined as
$$l(x) = log_2p(x)$$
where p(x) is the probability of choosing this class.

To calculate entropy, you need the expected value of all the information of all possible values of our class. This is given by
where n is the number of classes.