So decision trees tend to add high variance. That is they tend to over-fit. Now, since it doesn't make any strong assumption such as linearly separable classes, it tends to find structures that explain the training set too well. So small changes in data end up greatly affecting the prediction, and the model doesn't end up generalizing outside of that set. So one solution is to impose a max depth to help prune our trees, and this allows for our tree to only make a certain amount of splits. So let's say, we set our max depth equal to 2, we would now have a prune tree that only has this depth of two. We see, we cut off some leaf nodes, and we will now instead assign a predicted class to the majority class of that higher up node. But we can keep pruning, and it doesn't have to be defined by the depth of the tree. We don't have to say, max depth equals 2, there's other ways to prune our tree. So there can be other criteria such as a threshold of classification error. Something like if this leaf is correctly classified for 95 percent of its samples, then that is enough, we shouldn't do any further splitting. Other types of pruning that we can do is, we can also decide a certain threshold of information gain. So we need to get a certain amount of information gain to continue splitting. Or we can have a minimum amount of rows in the subset which we're no longer allowing further splits. So you can imagine, if it's down to three different rows in that node, we wouldn't want to do a further split into two leaf nodes of values of two and one and one and the other. We just say the majority class for that three, given that there's a minimum of three within each leaf. So some strengths of decision trees that we want to keep in mind here, is that since it's a sequence of questions and answers, it turns out that it'll be very easy to interpret as well as to visualize. Assuming that we don't have too large of a tree, too large of a tree can get a bit more complex. But even for those, we can look at the roots where those larger decisions are made. As mentioned earlier, in business, interpretability of a solution is a huge advantage. With a decision tree, for example, if we go back to our customer churn example. If we had a customer that is the subset so we're, again, we're subsetting our data, subset that pays over a certain amount per month, and then within that subset, we find another subset that also uses a certain low amount of gigabytes per month given our threshold so they pay a lot, given our threshold but they use very little gigabytes using our thresholds, maybe they are more likely to churn. That's easy to explain to a higher up or to whoever's helping you make this decision. Also, it's pretty easy to use it with various types of data. The algorithm will fairly easily turn any of your features into binary features on its own. Then as opposed to the distance or linear-based algorithms, there's no scaling required. Scaling would simply change the question at the nodes. But the ordering of those values, if we scaled them, would remain the same. So it'd no impact on creating those splits. So how do we do this in practice in sklearn? As usual, first we're going to import the class. So from sklearn.tree, we import our decision tree classifier. We then create an instance of the class. So we initiate that decision tree classifier with its hyperparameters and set that equal to DTC, and we have our different tree parameters here. We have here some hyperparameters help regularize this equation. First our splitting criterions can be Gini and we have the options of Gini versus entropy. We can also set the max of features that we want to look at when we're doing our split. This allows for a little bit less of overfitting when it's not allowed to look at every single feature. It can only look at 10 features at a time, every time it does a split. Then here, you see that we also use the max depth of the tree equal to five so that we can't create a tree greater than five layers deep. We're then going to fit the instance on the data and then predict that expected value. So same as before. We call that initiated model DTC. We call DTC.fit on our training data, and then we come up with our predictions on our holdout set. Then as we've done with other algorithms, we want to tune those parameters using cross-validation and then also if you want to do a regression, rather than decision tree classifier, you just have to use decision tree regressor and have many of the same hyperparameters, The main difference is going to be that the outcome variable will be continuous rather than having certain classes. So to recap the section on decision trees, we went through a brief overview of classification problems, where the ones we've looked at so far were either linear and logistic regression and support vector machines or took a lot of computational power to create a prediction with that k-nearest neighbors algorithm. We then introduced the concept of the decision tree and how to use that for classification. We got an understanding of how we should split our decision tree using entropy and information gain to help us decide where our split should actually be. Finally, we discussed the importance of pruning decision trees to address overfitting since decision trees are very prone to overfitting. At this point, we'll now dive into our lab and gain a deeper understanding of using decision trees and practice. I'll see you there.