A decision tree is a flowchart-like structure in which each internal node represents a “test” on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the outcome of the test and each leaf node represents a class label (decision taken after computing all attributes). The paths from root to leaf represents classification rules.

It has decision blocks and terminating blocks where some conclusion has been reached. The right and left arrows coming out of the decision blocks are known as branches, and they can lead to other decision blocks or to a terminating block.

#### Desicion Trees

• Pros: Computationally cheap to use, easy for humans to understand learned results, missing values OK, can deal with irrelevant features
• Cons: Prone to overfitting
• Works with: Numeric values, nominal values

createBranch

#### Information Gain

We choose to split our dataset in a way that makes our unorganized data more organized. One way to organize this messiness is to measure the information. The change in information before and after the split is known as the information gain. The split with the highest information gain is your best option. The measure of information of a set is known as the Shannon entropy, or just entropy for short. Its name comes from the father of information theory, Claude Shannon.

Entropy is defined as the expected value of the information. the information for symbol x is defined as
$$l(x) = log_2p(x)$$
where p(x) is the probability of choosing this class.

To calculate entropy, you need the expected value of all the information of all possible values of our class. This is given by

where n is the number of classes.