This lecture's about the tradeoffs and the different components of building a machine learning algorithm. So remember, from the previous lecture that we talked about the different components of building a machine learning algorithm. We started off with a question. So this is the question we are trying to answer, and we try to make it as concrete as possible. Then we collect some data. The data that we're going to be using to build our machine learning algorithm and to the, apply that algorithm later on. From that data we compress it into a set of features that we're going to use to predict with. And then we feed those features into an algorithm which we'll then use to apply to predict things. In my experience, it's been that the question is the most important component of building a good machine learning algorithm, getting a very concrete and specific question where you can collect data. The next most important step is to be able to actually collect data. Frequently the question that you actually really want to answer, the data won't be available, or the data will be only available in a sort of tangential. Form. Then creating features is an important component in that if you don't compress the data in the right way you might lose all of the relevant and valuable information. And finally, in my experience it's been the algorithm is often the least important part of building a machine learning algorithm. It can be very important depending on the exact modality of the type of data that you're using. For example, image data and voice data can require certain kinds of prediction algorithms that might not necessarily be as. Important for different kinds of prediction. So an important point that I think is worth hammering home is driven was actually first quoted by John Tukey, which is that the combination of some data and and aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. In other words, an important component of knowing how to do prediction is to know when to give up, when the data that you have is just not sufficient to answer the question that you're trying to answer. The key point that you need to remember. When making that decision is garbage in garbage out. In other words if you have bad data that you collected. Or data that isn't very useful for performing predictions. No matter how good your machine learning algorithm is. You'll often get very bad results out. In general the easiest thing to predict is when you have data on the exact same thing that you're trying to predict. In other words when you're. In the Netflix prize they were trying to predict new movie ratings, how people would rate movies and they were trying to make that prediction on the basis of old movie ratings. And so that's a very direct relationship, you can imagine how old movie ratings would be. Governed by similar process as the new movie ratings. In my area one thing that you might do is collect molecular measurements such as gene expression and that's a that's a measure of something that's going on in your body. And you might use that to try to predict how people respond. To disease or to other different medical conditions. Again they're sort of a, you can imagine a pathway from the gene expression data we've collected, the molecular information we've collected, a value to your disease status. In general it depends, what is a good prediction, what are the features that you would collect to be able to predict. But in general the closer that you could get to similar types of data, the closer that you can do to perform a high prediction of accuracy. This is a link here to a video that talks about the, unreasonable effectiveness of data. In other words, that getting a lot more data can often beat out getting much better statistical or machine learning models in almost every case. And the most important step in developing a machine learning algorithm after you define a question is collecting the right data. Data that's relevant for the question you're trying to answer. The features also matter. So the important properties of good features are that they compress the data in a way that make it possible to compute your prediction, or your machine learning algorithm, in a very simple way. And maybe are more interpretable than the data that you've collected that's the raw data that might be complicated. They're, the good features retain all the relevant information while compressing the data, and that can be a hard balance to strike. They're also usually created with expert application knowledge, and this is something that there's a debate in the community about whether it's better to create features automatically or whether it's better to use expert domain knowledge. And in general it seems that the, expert domain knowledge can help quite a bit in many, many applications and so should be consulted when building a features for machine learning algorithm. Some common mistake are trying to automate feature selection in a way that doesn't allow for you to understand how those features are actually. Being applied to make good predictions. Black box predictions can be very useful, they can be very accurate but they can also change on a dime if we're not paying attention to how those features actually do predict the outcome. Not paying attention to data specific quirks is another problem that can come up so in some cases. The function of a particular set of data set might be that there's outlines if there's weird behaviors of specific features and not understanding those can cause problems. And obviously throwing away information unnecessarily is not a good idea. You can automate feature selections sometimes with care, in fact there's a whole area of research that's been dedicated to this so what I would call sort of semi supervised learning or deep learning. And so the idea in this paper was to try to use, to try to discover features of YouTube videos. That could later be used in prediction algorithms. So for example they were able to create a very accurate predictor of whether you were a cat in a video or not. Based on a bunch of features they sort of collected in an unsupervised way. In other words they filtered through the data in a way to identify those features that might be useful for later predictive algorithms. But even when they did this, they went back and looked at those features and tried to figure out why they would be predictive and so for example these features, this feature here makes it very clear why it would be a good predictor for a cat. Because you can kind of see the image of a cat there in the video. Feature that they've collected. Algorithms matter less than you'd think, and this can be a bit of a source of surprise and frustration for some people. So this is an, a table where they actually tried predicting a variety of different prediction tasks, for example, sort of a segmentation task, predicting votes in the US House of Representatives, and predicting wave forms, and a bunch of other different prediction tasks. And so they did it with two different ways, first they used something called linear discriminant analysis which is sort of a very basic early predictor you might learn. And then they also tried for each data set to find the absolute best prediction algorithm they could have and then this table shows the prediction error of these two different approaches. And you can see that the best prediction error is always a little bit better than the linear discriminant error. But, it's actually not that far off. So, for example, here in this Pima prediction, it was 0.19 was the, 0.197 was the error with the best method. And 0.22 for linear discriminate analysis. So it seems like using the same method over and over again did make the prediction error worse, but not incredibly worse. And so that's very typical of many applications. That. Using a very, sensible approach will get you a very large weight of solving the problem. And then, getting the absolute best method can improve, but it often doesn't improve that much over, sort of, most good sensible methods. Some issues to consider when building a machine learning algorithm are there, there's different components to it. First, it might, you might need it to be interpretable. In other words, if you're going to be showing this machine learning algorithm to doctors or, in my case. Or in the case of a web company you might want to show this to the CEO. You want your predictor to be interpretable so they can understand it. Part of being interpretable is often being simple, in other words, being very easy to explain is often better than being really complicated and hard to understand. And often you have trade offs with respect to accuracy, so getting those two qualities can reduce your accuracy, and the question is how much do they. Reduce your accuracy. Is it too much to give up the interpretability and the simplicity that you gain? You also want it to be scalable and fast. By fast I mean it's very easy to, build the model, it's very easy to test the model in small samples. Scalable means it's easy to apply to a large data set. Whether that's because it's very very fast or whether it's because it's parallelizable across multiple samples for example. So in the general prediction it's about accuracy tradeoffs. So the idea is sometimes you will trade off a lot of accuracy or a little bit of accuracy for interpretability. Sometimes you'll trade it off for speed or simplicity or scalability. But in general, what you're trying to do is find that right balance between those, those other features of the prediction algorithm that might be important to you, and how accurate you need it to be. Interpretability often matters, so for example, in medical studies, this is a paper that studied how some physicians like better prediction rules that look like this, they look a little bit like decision trees. In other words. You start off and you say if total cholesterol is above a particular value and you smoke, then you get a certain outcome. So those sort of if/then statements can be very interpretable to some people, and so that's about the reason why people like things like decision trees. And that can matter more or less depending on what your, the purpose of your algorithm is. But interpretablity prevents things like the Google flu trends problem where people didn't really understand how the features were directly tied to the rate of flu. And so if you don't understand how the features are working, you can get caught when the algorithm starts to, have problems or fail. Scalability also matters. A very interesting component of this Netflix challenge is that the Netflix prize was a million dollar challenge, and they had lots of teams compete, and several teams produced analytic algorithms that were much better at predicting than Netflix original algorithm. But it turns out they actually never implemented the final algorithm on their production machinery, and the reason was. That the algorithm didn't scale very well. It was a blend of many, many different machine learning algorithms and it took a long time to commute, compute particularly on the huge data sets that Netflix was collecting. And so they actually went with something that was a little less accurate, but quiet a bit more scalable. So that tells a lesson of accuracy isn't always the best. And only decision maker when trying to build prediction algorithms. Although that's often the one that gets focused on the most.