So now let's get back to decision trees for classification. And we want to recall that our first step is going to be to select the feature, and ask a question with a true, or false answer to first start rolling our tree. So split it into those two subsets, given whether, or not, we answer true, or false to a given question. We then continue splitting with our available features to create further subsets of our data and can continue to split. And what we need to think of is, when can we actually stop splitting and declare we've made a decision. So, if we think back to our tennis example, we can split there by outlook of weather, temperature or humidity levels and stronger weak wind. And then after we split perhaps the subset of sunny, mild weather and weak wind would lead to a subset where everyone plays tennis given that it's sunny, mild weather and weak wind. Or as a split of rainy versus not rainy, just that subset of rainy may already determined that no one will play tennis on that day. So one of the ideas in terms of determining when we should stop splitting is to continue splitting until the leaves are pure. That means every leaf once we do all our splits is a collection of only one class. So either everyone in that subset plays tennis, or everyone in that subset does not play tennis. And there aren't any leaves with the combination of yes and no's. It's either fully will play tennis and that subset or fully will not play tennis within that subset. Another idea is we can enter a max depth and then stop or prune our tree at that depth. So, when we say stop, meaning we don't make any more splits, prune means that we build it all the way out and then we start pruning that tree, cutting off some of the leaves to get down to our original predetermined maximum depth. And at that depth, we don't allow it to grow any further. As a result, leaves are probably not going to be pure in this example. And we can make our best guess at that point by determining what class in that subset is more dominant, which one has the majority, and we can assign that class to that leaf. Now a third idea is we can keep going until a predefined performance metrics is achieved. For example, if the classification has a certain accuracy, then we would stop. In practice, that first idea will often overfit, so if we allow it to keep going until all of our leaves are pure, then we will often overfit our training set. That is our split so that questions that we ask at each node are usually going to be tailor made for that training set and may not generalize well. And those last two ideas are an order to avoid overfitting. So either pruning at a maximum depth or setting once we get to a certain accuracy metric we stop. So how exactly do we find the right splits? The right question to ask in terms of that true or false in order to get to those subsets that we'd like to ultimately get to. We need a wave evaluating all the possible splits. Once we have that wave evaluating the splits, we can then use greedy search to find the best split. And greedy search here means that at every step, we find the best split regardless of what happened in prior steps or what would happen in future steps. So how do we find that split? Enter here information theory. We're going to find the split that induces the biggest information gain, which we will continue to elaborate on in a couple of slides. But how is that information gain going to be defined mathematically? Let's go ahead and take a look. So let's start by first trying classification error. Just seeing if we can decrease the overall average error when we split our data. In other words, we're doing 1- the overall accuracy. And that accuracy metric, just think how many we got correct over how many there are there in total. So at the top node, given that the majority vote would have been yes, yes, they won't play tennis. We have a classification error of 1- 8/12, 1- our accuracy. Since we would have predicted four out of those 12 incorrect so we have an error rate of 0.33 here. Now stepping down to the leaves. Now I actually want to introduce some vocabulary here. The top node, we will consistently be calling the parent whereas the nodes underneath such as the Yes No 2 and 2 that we have here in the Yes No 6 and 2 that we have here. Those will be the children. So we'll have the parents on top and the children underneath, and that'll be interchanged with leaves as well. So stepping down to the leaves, steeping down to the children, we just see a classification of 0.5 on the left side. So we actually have some information loss rather than information gain here. But it's with a smaller fraction of our data points compared to what we see on the right leaf. And then on our right side, we see that we reduce our error from that larger subset from 0.3333 to 0.25. Now, we want to measure that overall change according to the weighted average of our two new classification errors. Where the weight is equal to the proportion of that parent node that are now in each one of those two different child nodes. So we're going to weight each one of those classification errors. According to how much of that parent node went into the subset of each the left and right child nodes, so we get on the left side 4/12 times 0.5. And then on the right side, we get 8/12 times 0.25. And we find that once we multiply those two and then add them together to get the weighted average, we find that overall we have not had any information gained given our new split. So in this example, assuming this is the best split that we had left which is possible, we would claim that we should not split the nodes any further. The problem here is that we are forced to stop well short of a point where all of our leaves would be homogeneous. Such that all would play or all would not play in each one of our child leads. With this possibility in mind, in our next video, we will begin to discuss error metrics using tactics that actually allow for us to continue to split until we have perfectly homogeneous nodes. All right, I'll see you there.