Hi everyone and welcome to our lecture on mathematical modeling. In this chapter, we're going to take some of the theory that we learned in prior courses, and we are going to apply it now and model some real-life situations using our knowledge of pre-calculus. First, before we get into modeling and different functions, we will talk about a model, think about anything that you've modeled before or in any other class, which we're really talking about is this specific type of function. But there's more to it than just writing f of x equals or maybe even a multivariable function. A couple of things about modeling that we want to talk about, and we have a nice image here from the Stepping Stones Project at Indiana University. The modeling is usually a cycle, it's not a one and done approach. Very rarely does it ever happen that way. A couple things about a modeling is that, and this is again very broad and we do specific things and might change a little bit. But the modeling cycle begins by asking a question about something of interest, something that you want to learn more about, something that you want to know. I wonder what happens if and if I change one variable, what happens, if I change them input, what happens? Of course, we want to gather some relevant data to the question. We're going to see a lot of scatterplots and a lot of other like Excel spreadsheets are going to come in here. We want to conjecture a model that describes the data, the accuracy of this model. You're going to have a model, you're going to predict things. Usually you test it against some other gathered data. Sometimes it's called your test data and your training data, and then we want to leave with some modifications. We want to change some things and we always want to make our model better and keep testing it and keep updating it as we learn more information. This cycle is repeated until we really feel we have the best understanding that we can of whatever we are trying to study. You can see here we're going to grab a real world problem. There it is. We want to simplify it under some assumptions. You often, when you start, you start with a simple model, and we're going to do that in this chapter and we get a real model. We're going to test it against some data. We try to abstract it, we're going to calculate some things. Often this is used with a computer, although you certainly can do it by hand. We're going to have some conclusions always with our parameters and framework of what we mean. We're going to interpret the results, is extremely important. What are the units? Does the answer make sense? Then we keep adjusting as we go. But the main topics in the following sections to come is the relationship between two variables. One of the main themes that you're going to see is this relationship is that one variable can strongly be influenced by the other variables, and they can also be, if you take a little stats course, there can be variables you don't even consider in your model. Sometimes they're called confounding variables or lurking variables, but there's other pieces that go into it. At least with the known variables, we're going to see what the relationship is between them. Often, most data studies, they studied more than one variable. We're going to keep things simple for these intro examples, and just study two, but you can certainly study more than one variable. But everything that in multivariable data analysis builds on the study of two. Understanding two variables really well, how to graph them. Understanding tools will allow you to then extrapolate and do more complicated modeling. The principles that are going to got to work whenever we get information, of course, we always want to use your prior background and you want to look at summary statistics. You want to go through all your calculations. We're going to look for overall patterns in the data. We're going to look for deviations in the pattern, and when we find patterns that are regular, then there's usually a way to describe it briefly using some model. What you see here on the screen is a scatterplot of Old Faithful interruptions. This is a geyser on Yellowstone National Park in the US, and this shows on the x-axis the eruption duration, how long it last in minutes, and on the y-axis, there is waiting time between eruptions. One thing about Old Faithful they say is that it is very, very regular. You can see here lots of data has been graphed. If I handed you a table, a spreadsheet of this information, your eyes might glaze over. It's very, very difficult for us to gain patterns and data by just looking at tables. We're going to use one, there's many ways to do this. For the most common way to display relationships between two numerical variables is with this scatterplot. The first thing you want to say about a scatterplot is that it shows the relationship between two numerical variables on the same data. Just some subtle things here about scatterplots. Sometimes you see scatterplots horribly wrong, but it's between two variables, so it's a nice way to do a two-variable comparison. It's very important that the data describes the same dataset, so you know who grab some dataset over there, and another one, a completely different and trying to graph them. It's the same dataset, and it's important that you graph numerical variables. You can't graph what are called categorical variables, like letter grades or something like that. You really want to do a two numerical or quantitative variables on a scatterplot or axis. I always thought the explanatory variable, if there is one on the horizontal, is the x variable, and as a reminder, we usually call the explanatory variable x. The output variable, or the response variable, as it's sometimes called, will be on the y-variable. Once in a while, you just trying to look at data between two things where there's not necessarily an explanatory variable distinction. In that case, either variable can go on the horizontal axis. From looking at the scatterplot, you can see that there's clusters around the data, and we see a little line here that's trying to graph what is maybe the middle of this data. But you can see there's some pattern going on, and if I just showed you a table with all these things scattered about, my guess is that pattern would not be as obvious as this is very nice scatterplot. I mentioned before that you can also do things with multivariable datasets. I just want to show an example of what a very fancy scatterplot or plot would look like. This is using not only an x, y, and z axis, but it's using color to designate a fourth variable. They can get fancy, but I'm only going to show you this. We're not going to really do much with this in Clinton's course, but you can see that they can get complicated. Mastering the two-dimensional case and all the pieces that go into that will allow you do the fancy stuff like the one on the screen here, which is a three-dimensional scatterplot with color. I want to provide a couple of examples here of scatter plots just to show that they can look very, very differently. Even in the two-variable case. Whenever you get different scatter plots, they may have lots of data points, they may have a few data points. In general, when you're working with scatter plots and you're graphing things, you always want to apply the usual strategies of data analysis, and in particular, scatter plots give you an overall pattern for the data. What you should look at is where is data clustered, where there are some spots that are off the line, what's going on with those? You want to look at that and study those separately and see if they truly are deviations. So we describe scatter plots and their overall patterns using direction, form, and strength of a relationship. These are the things you're going to hear me say over and over again. One of the first deviations that we're going to study are outliers. These are usually defined as individual variables that fall outside overall patterns of the range. These guys will take some extra study on our behalf. The graphs that I'm showing here on the screen, so there's scatter plot, where most of the direction is, it's pretty clear I think on the first one, there's lots and lots of data points that are clearly increasing. So as x increases, the y value decreases. You can see a nice diagonal plot and that's pretty much the same, and in the second one as well. The third one now the pattern is less clear. It's a little more random, but there's a nice cluster up on the top right of the graph. So where the x-axis is very positive and the y-axis is pretty positive. So lots of data points there. Then here's another one.This particular one is a piano practice and time spent in practicing, and the number of incorrect notes played, might make more sense that as you increase your hours practicing, you're going to decrease the number of notes. So we would say that this direction is negative, and this gets into this idea of association. Keep these different pictures in your mind. Some people always have this default picture of a scatter plot in their mind. But I want to find two words here, so we say two variables are positively associated when above average values of one tend to accompany above average values of the other. Usually in this particular case we have positive association. For positive association, you have your x-axis, you your y-axis, your x-variable, and they tend to slope up and to the right. So you get a graph that usually goes up. This is the positive association. Two variables are said to be negatively associated when above average values of one tend to accompany below average values of the other. In this negative association, we tend to have a scatter plot but for the most part, no hard, fast rules here, but for the most part it slopes downward as you move from left to right. These are not the only ways to have scatter plots. There are lots of different scatter plots, if you can think of it, you can graph it, you can find a data set that fits to it. You can certainly have a scatter plot that starts to be curved. You can have a curved relationship and then you can have different strengths of the scatter plot. In the example I drew for positive association I drew the dots pretty much close together, but for negative association, let me just draw some more dots randomly down the page here. Now we would say that this is a strong positive association and this one is a little weaker, it doesn't follow as clear a pattern. So we can get into strong, weak, positive and negative and we can mix them all up. Let me give an example. People have always been, for centuries, this has been a thing where people try to associate brain size with intelligence. So what we can do here, and this is some data here. I'm going to give you brain size in units of 10,000 pixels of a screen, of six individuals as follows. We have brain size 100, 90, 95, 92, 88, and 106. Their intelligence, that is their IQ score is measured as 140, 90, 100, 135, 80, and 103. Just a nice data set, obviously very small, too small to be anything meaningful, but just to make a point. So if you stare at this very small table, do you notice any patterns? That should not be obvious to you what's happening here, but what I want to do, and we're going to show you how to do this in a lab, is that from this data, from any data set, you can make a scatter plot. So here on the screen you see a scatter plot of these numbers. We have a nice title across and our X and y-axis are labeled, and we get a scatter plot. Now I can visualize and see the pattern and the association. Let's just look at this graph for a second and say what we see. The dots are graphing the data set, giving us a little visual. You can see they move up to the left. So as your brain gets a little bigger, you have a little more intelligence based on this six-sized data set. This is just for this data set. So we would say it has a positive association. However, would you say this is a strong association? You can see the dots don't necessarily follow a line, they are scattered about. There's nice distance between these dots. So it should be a very weak, a weak association to it. But you can see just in this very small example, it's difficult to get any form from this, from the table. For the human eye, it's better to look at a scatter plot and try to look at associations. You can imagine if I'd handed you a spreadsheet of 1,000, 10,000, 100,000 data points, that's extremely difficult to study. So we want to look at scatter plots. Next topic of discussion is correlation. Scatter plot displays the direction, form, and strength of the relationship, but straight line patterns are particularly important. We're going to look for things that follow a general shape of a line, positive or negative, and this is a simple pattern, and honestly it's simple, but it's quite common. Straight line relationship is strong if the points lie close to a straight line, and we say they're weak if they're scattered about lots of space about them. Our eyes are just not very good at judging how strong a relationship is. So the two scatter. So we like to do is have some mathematical measure of how strong the relationship is. We have this, and this is called correlation. Correlation is a measure to describe the direction and strength of a linear relationship or straight line relationship between two quantitative variables, and is usually denoted as r. Now calculating r, this number, which is going to tell us how good something is, is usually a lot of work and best done for a computer. If I have n points, I'll give you the formula here. I'm just going to write it down, you never need to know this because you want to see the formula. So for n data points, and we'll list them as a table. So you can imagine you have your x variable, you have your y variable, and you have your x_1, x_2 dot dot dot and your y_1, y_2 dot dot dot, whatever it is, you have n of them. So this goes all the way down to x_n and y_n. This formula for correlation, maybe you've seen this in a statistics course before. It goes a little bit beyond the scope of the class, so you should know what it means. It's one over n minus one, is the number in front and then we're going to add, I'm going to use a Greek letter here to denote this and I'll explain that in a second. X minus x bar over S_x, and then y minus y bar over S_y. Let me explain some of these things here. There's a lot of symbols in this page worth probably going over, so i equals one to n. So this first thing, the Greek letter that you see on the screen, that is a great Greek letter to know. This is called a Sigma, and this means we're going to add. So you calculate each of the terms in parentheses and you add them up for each pair, x_1, y_1, x_2, y_2, all the way down to x_n, y_n. The bars over letters, these are your averages. Sometimes this is called the mean. So that's what x bar is and of course there's a y bar on the other side. The x variables, I should really put an i here as a subscript. So these are your values, you calculate this for x_1. So it's x_1 minus the average divided by S_x. So what is S_x, S_y? Maybe you tickle your statistics and you know what these variables are. This is your standard deviation. So these are the standard deviation of all your x variables and the standard deviation of all your y variables. Now I can get into what that is, there is another formula to calculate that. Maybe you've seen it before, maybe you haven't. But for our purposes, don't worry about it, we're more interested in the interpretation of r and we're going to use some calculator to get r for us. But I want you to see the formula and realize that it's going to feel a little statistics like that for a minute here. So this is the idea of correlation. This number r is going to be positive or negative, but it's going to measure, this is important that we want, it's going to measure how strong the linear relationship is. So we are going to use some calculator to get r. So imagine that r, this is the correlation, straight line correlation, linear correlation, this is going to be given. So you find this and you have some numbers choose a decimal. So the question is, what do you do with it? How do I use this thing? So first off, if r is positive, so we could write that as r greater than zero, then that means we have a positive association between the variables. Similarly, if r is negative, so we'll write that as r less than zero, then we have a negative association between the variables. Now appreciate that for a minute. Let's say our scatter plot is really bad or too much data, whatever, we can mathematically calculate some number and this will tell us if it's a positive number, we have positive association, if it's a negative number, we have a negative association. The second way to interpret this correlation, this number r, we're to say number r but that's what it is, is that first off, it's always going to be between one and I'll say negative one and one. So it's always a number between negative one and one. Algebraically if I were to write that, you can write that as negative one less than or equal to r, less than or equal to one. Now, how many numbers are there between negative one and one? There's a lot, if the many lots of decimals. But the idea is I always think of this as a scale. So you have the number line, you have one over here on the right, you have a negative one on the left and you have maybe like those old-timey, I don't know some arrow or something like that, where you can have it point anything. Whatever r is pointing to, there's lots of numbers on this interval between negative one and one. The closer you are to one or negative one tells you how that the values, they lie very close to a straight line. The extreme values r equals one, negative one, they only occur when the points in the scatter plot are exactly along a straight line. So if you move closer to the edge, closer to one or closer to minus one, if you have like 0.9999 or negative 0.9999, then you're very strong, this is a strong linear relationship. So the sign on one plus or negative gives you strong positive association and then sign on negative one says it's a strong negative. If you have a value like 0.001 or something extremely close, 0.01, close to zero on the number line we have a no linear association. So we are getting at a way to say where we can compare using this correlation, we have strong positive, strong negative, or no and anything in between is perfectly possible. So for example, if I give you like r equals 0.994, well that's going to be strong because it's close to one and it's going to be positive. So we have a strong positive straight-line pattern and that'll be a graph that we'll see in a second. Or if I gave you perhaps r equals 0.002. Well, this will be a weak to no linear, so straight line pattern or association. If I gave you r equals negative 0.892, well, it's close to nine, so it's pretty strong and then negative as well. You can imagine anything you want. Usually three or four decimals is more than enough to tell us what we were looking for. Pretty common to see three or four of them. Let's look at an image with a couple of scatter plots that give examples of this relationship. We're going to use the letter r. A couple of things here. The first scatter plot is three points. The line is sloped, downhill so it's strong. We're going to be all line up in that line. That's r equals negative one. You only get negative one if you land on a line. This next scatter plot, they're moving downward. They have some weaker association. They're not all in the line. We're going to get some value r, that's near negative one, but it's not going to equal. Same thing here, if I have an r value between zero and one, it's positive they are off the line, but they're moving in the general direction, moving upwards. If I'm exactly on the line, then we're going to have r is positive one. If I have a random scatter plot, again, this is perfectly normal or some things that maybe it follows a parabola or some other function that we're going to have r equals zero, which is saying that there's no linear relationship. This is another way to interpret r based on its decimal value and what you can expect from the scatter plot. A couple of the things to mention as we interpret r. If the units are changed on the x, y variables, r does not change. Think about that for a minute. Doesn't matter if I measure the temperature in Celsius or Fahrenheit, if yes, I'll get a different graph. But the relationship between temperature and say pressure or temperature or something else does not change. The correlation is, we say independent of the units, which is quite nice. That's nice to know. It doesn't matter. You don't convert anything to some standard units. Other ways to interpret correlation. Correlation, it ignores the distinction between the explanatory and the response variables. Remember, explanatory is your x-variable response will go on your y axis. What does this mean? If you graph these things in reverse, if you put one set of data on the x-axis, or another set of data on the y-axis, the correlation does not change so you can completely graph. It doesn't matter. There's no wrong variable to put on the x-axis. Another nice thing about correlation. That's nice too when you're doing computer, there's no wrong way to feed the data into some computer and have it, you get a different value of r than see someone else, it does not matter. Third thing I want to point out is that correlation measures the strength of straight line association between two variables. This is important, it's only for straight line relationships. You may have a beautiful scatter plot that perhaps follows a parabola. There's clearly a relationship or maybe some square of relationship between the two variables. If you test this for straight line, then correlation's going to say, "Yeah, no, this is r equals 0." There's no straight line. The wrong way to interpret that is to say, "Well, since there is no straight line relationship, there's no relationship at all between these variables." That's not true. Clear, there's a parabola or some other graph, it's just not a straight line. Just watch out for that interpretation there as we go. Last but not least, correlation is strongly influenced by outliers. This is a little word of caution. If you have data and you had outliers in your scatter plot, then they are going to affect the value of r. This is like an average or somebody like that. If you and Bill Gates were sitting at a bar and the average income of everybody is like a millionaire. Just be careful, be careful, be careful. If you have outliers in your scatter plot, you might want to deal with them separately. Perhaps you can take them out of the data if you can explain them for whatever reason, but just know that r is influenced by outliers. In summary, the key takeaways here is that there's many kinds of relationships between variables and there's more than one way to measure them. One of the things you'll see, used correctly or incorrectly is that this value r, the correlation is common. Do give, some computers give it by default, but it has its limitations as we saw in the last one. There are many pros to it. Units don't matter. It doesn't matter what goes on the x and y values. Nice to compute. This is why it gets spits out a lot. But the wrong thing to do is just use it blindly. You want to make sure it only makes sense for quantitative variables. It doesn't even make sense if we're talking about something like gender or favorite color, that doesn't work. You just want to be sure it's used correctly. The other piece here is that it's not a complete description. Just because someone hands it to you and tells you that their relationship is strong positive or strong negative. There might be more things that are going on than just that. As always, hopefully you've heard this before. Correlation, just because I'm correlated does not imply causation. Be careful when you're presenting your data, you showing your scatter plot and you say, "Because of x, then y is a result of it." X causes y. No, no, all this does is just show that the two variables are correlated, not necessarily implying that one variable causes the other. Just be careful how you interpret and present on your data. All right. This video was just an introduction to the idea of modeling and the idea of correlations and associations. We're going to calculate r. In the next video, we're going to work with some data sets. We're going to make some scatter plots. I look forward to it. I'll see you then.