>> So last week we introduced you to the mechanics of regression analysis, and this week we're going to look at it in a little more detail. We're going to look at some of the interpretations of regression analysis looking at some particular data sets and to think about some of the problems that arise in interpretation of regression results. So there are really two ways you can think about a regression, one is as a way of using one variable X to explain another variable Y. Or you can think about it as a way of using the relationship between X and Y to forecast what the value of Y is going to be in the future. So in this particular week, we're going to focus mainly on the first of those thinking about the causal explanation that we might advance for the relationship between X and Y. And in later books, were going to come back to the issue of forecasting based on regression analysis. In terms of the sports data we're going to look at here, we're going to look at maybe one of the most fundamental relationships which is the relationship between the players and the performance of the team. Obviously the players and the most important input and one way of representing the players as inputs into the performance of the team is to think of how much the payers are paid. There as a labour market for players and better players tend to get higher salaries. So you would think that teams with higher wages would tend to be the more successful teams and we're going to examine whether that's true. Looking at four different leagues, we're going to look at the NBA, the English, Premier League, Major League Baseball and the NHL. And so today we're going to start off by looking at that relationship in the NBA so let's get started. The first thing we're going to do is as ever run the basic packages we need in order to carry out the statistical tasks we want to carry out and then we're going to load the data. So we're going to use NBA data here and we're going to look at the total amount spent on wages and relate that to the win percentage of teams in the NBA regular season. So let's look at the data that we've got, you can see here a very simple description. We've had 210 observations in our data running from year 2012 to 2018. And if we want to see some more information here, we use .info and we tells us the kind of objects we have in our data. So the first thing I want to do is look at the total salaries paid out in any one season. And we do that using a .group by and .sum to generate a value of total salaries paid out in any one season. And this immediately brings to our attention a very important point, which is theirs significant salary inflation in the NBA between 2012 and 2018, the totally salary pay out more than doubled. And of course, it would be crazy to say that the quality of the players in the NBA more than doubled. This has more to do with the inflation in the value of television, broadcast rights and so forth, and the capacity of the players to negotiate a share of that increasing revenue. And that's going to have an implication then, for when we think about performance in some sense. If you're going to use salaries to buy better players, then in 2012 it was in a sense cheaper to buy win percentage than it was in 2018. Again, because of the inflation in the total revenue, not really because of anything that changed about the players. So we want to take that into account in our analysis, so the way we're going to do that is put the total salary value back into our data frame. So we're going to use a pd merge to put that data back into our NBA frame so that we can see the total salary in each season. And then we're going to create a variable called relsal, and relsal means the relative salaries of the team in question compared to the total salary paid out in that season. So what's important is not how much money you spend, but how much more or less you spend than your rivals in the competition. So we create the variable rail cell and let's now do a simple plot of relsal against win percentage. And you can see here we have relsal along the X axis and win percentage along the vertical axis and you can see the dots are fairly spread. But there's a clear trend towards higher wind percentages with higher relsal with a higher spending relative to the total spending in the league. And that line that you can see there is actually a regression line, so we can see already that there's going to be a positive relationship between salaries and win percentages. So let's look at that then by running a regression, so we're just going to run the simplest regression. We can we're going to have win percentage on the left hand side, that's the Y variable, we're going to explain. And relsal on the right hand side that's going to do the explaining and here we get a simple regression output. And the thing to notice then the thing we're focusing on is the coefficient on relsal here, which you can see on this line. And let's talk about the values, we can see we've got a coefficient of 11.3, a standard error of 1.78. So when we take the ratio, we get a T statistic of 6.6, which gives us a P value of 0.000, in other words, relsal is highly significant in explaining win percentages for teams across the seasons. Now, it doesn't explain that much of the total variation is we look at the r squared up here, we can see that it's only about 17%, so it's by no means the only factor that we should consider. But nonetheless, our first preliminary analysis suggests that it's relatively important. And if we say, well, how important is that if we think about how win percentage is affected by relsal, we can do a rough calculation based on the 11.3 coefficient. And say that roughly speaking, increasing the total salary share by 1% will increase win percentage by 0.1 11% points, so that's a very large increase. Remember, the share of total salaries range is somewhere between 2 and 5% for the team. So in that sense, increasing the total share of salary by 1% is a very large increase in salaries going from 2% to 5%. Will increase win percentage from 33% to 66%, in other words, but that's taking you from the bottom of the league to the top of the league. So in that sense, being at the top of the salary spending tree is likely to make you near the top of the win percentage tree. And being near the bottom of the salary spending is likely to put you near the bottom of the league. So what we have to ask ourselves is whether we find this a plausible description and whether where there is a danger that we might have missed something else. So this raises the question of what's called omitted variable bias, we have an estimate of the relationship. But maybe it's overstated or even possibly understated because we've left something else and that's something else is correlated with our variable. And that means that we've misstated miss estimated the relationship, so really the only way to deal with omitted variable biases. Is to find other variables and include them into the regression to see whether they change our estimate of the coefficient. So let's do that now and we're going to pick a particular variable to add. And that is what's called the lag dependent variable, that is in this case your win percentage in the previous season. And it's reasonable to think that there's some continuity in the performance of teams from season to season, since players are under contract typically for more than one year. And therefore we might expect that last season's win percentage will be correlated with this season's win percentage. So we can test that by inputting last season's win percentage into our regression and we're going to do that now. If we just saw the teams, we can see measure each team by year and we can see we want to create a variable which is just the last season for each team. And python does that very nicely for us, we use this command shift and when we do that and we .shift(1) that creates a variable which is the last value in the series. And in this case, we're going to apply that to win percentage and we're doing that by team. And so when we create that, you can see here for each team, we've now added the win percentage for the previous season. Notice, of course, in the first season, in our day to 2012, there is no lagged value because we don't have the date of the 2011. Okay, so let's see what that data looks like you can look at it if you want to look at the whole data set, we can open up the window, but now let's run our regression including the lag dependent variable. And so we've added WPC lag and indeed that turns out to be highly significant. We can see the coefficients 0.6, the standard error is 0.06, so the T statistic is 9 and the P value is 0, so that's a highly significant variable. And what's also important to note is that relsal now has fallen from 11 to just the value of 2, the standard error is 1.86. And so the T statistic is 1 and indeed that implies that relsal is statistically insignificant. So not only could we say we should have included the lag dependent variable in our regression, but this now suggests that relsal is not a significant variable in determining team performance. Well that's a solitary lesson in what happens if you omit relevant variables. But we should also consider whether we've really gone as far as we should. Maybe there are other variables we haven't included, just as omission of a relevant variable can bias our estimates upwards, so omission of variables can bias our estimates downwards. So now let's look at another type of variable which takes into account the performance of each team, all the teams here are clearly different. They have different histories and cultures and so we might think that those histories and cultures will play a role. And we can allow for that by including something called a fixed effect, if we allow in our regression for a dummy variable, which is 1 for the team in question and 0. Otherwise, we can allow for the fact that each team is different and that might then alter our estimates for the coefficients we've already calculated. If these fixed effects are really significant and adding fixed effects in python is really, really simple, if the variable is team here, each team is different. Each team has a separate name and we just add this to the regression. Capital C (team) and that will now estimate a fixed effect for each team in our aggression. And of course we still have relsal and lagged win percentage in our regression. So if we run that now, you can see we have a lot more coefficients because we have a coefficient for each team in the data. And you can see that some of these fixed effects are significant and some of them are insignificant, but overall these fixed effects play a role in determining the outcome. And one way to see that is to see the r squared has now gone up quite considerably, so fixed effects definitely seem to matter. How has the addition of fixed effects affected the variable that we're really interested in which relsal? Well look here at the bottom, you can now see that the coefficient on relsal has increased to just under 5. Remember in our previous regression it was only two, it's now statistically significant. And you can see the P value is 0.2 which means, remember our threshold for statistical significance is 0.5 typically, and therefore this tells us that relsal really did matter. So first we over estimated it, second, we seem to have underestimated it and now we might think that this is probably a reasonable guess as to what the effect of relsal really is. And we can then use that to calculate the impact of relsal on performance. So, here's the first example then, of how you might think of running regression analysis in practice. It points out how you might think about looking for causal variables, then thinking about omitted variables. Thinking about also how to cope with the heterogeneity, the fact that all of the teams are different. With a focus on how these changes then affect the estimate of the coefficient you're really interested in which is relsal in our data. And what we're going to do for the rest of this week is really repeat the same exercise for three more leagues. To get a sense of how we should proceed in practice when we're conducting regression analysis.