If you want to describe the relation between two categorical variables, you can summarize the observed frequencies in a contingency table. And provided that the observations come from a representative random sample, these relative frequencies can be interpreted in terms of joint and marginal, as well as conditional probabilities. In this video, I will explain how to calculate these statistics. But then also show that this information is not sufficient to make a general statement about the relatedness between the two variables in the population. Not so long ago, I had a most remarkable conversation with someone at a dinner party. His name was Paul. I'm sorry that I will digress a little bit at the start of this video, but I promise it won't take long before I get back on the main track. Paul was an art historian who explained to me that he was super excited that evening, because he was just commissioned a job to investigate paintings from the Renaissance and Baroque periods. And not just on a dull topic, but to study the use of different objects, or artefacts on still life paintings, like flowers and fruits. The study of old still life paintings may not sound exciting to everyone, but it certainly was to Paul. He explained how he's going to start and look at all sorts of symbology, as well as the geographical and cultural context in which paintings were created. He would also check his expectation that different topics were fashionable in different periods. In particular, early Renaissance, later Renaissance, and baroque. His plan was to start by distinguishing still life paintings with only fruit, with only flowers, and with mixed objects. Now, it was my turn. At the risk of sounding a bit pedantic, I asked Paul whether he was familiar with the term contingency table. He looked at me questioning and a bit irritated that I so suddenly changed the topic. Next, I checked whether the term conditional probability sounded familiar. And he returned the same empty gaze. I quickly assured him that I was in fact not changing the topic of our conversation, but could explain these terms in a manner of minutes. And that it would help him enormously in getting results for his investigation in still life paintings. I took a napkin and started to sketch. Let's assume you've gone through several collection catalogues and have studied a good number of paintings from the relevant art periods. I'm just going to use small numbers in the example here. Then, to investigate your question of a possible relation between the object that's painted in period, you need this simple table where you count the number of paintings with fruit in each of the periods, and same for the number of paintings with flowers and with mixed items. You make sure that each painting is only counted once, so it would not occur, for example in a column of flowers as well as mixed items, but only in one. In your study, you think that the period has an influence on the objects that are being painted. Then the period is considered as the explanatory variable and the object as the response variable, it is then customary to put the explanatory variable into rows and the response in the columns. You can sum over each column to get total number of paintings for each type of object and over each row to get the total number of paintings per period. The numbers in the center of the table are called joint frequencies and those in the margins are the marginal frequencies. Because the number of paintings is varying per period, it's not that easy to compare the different periods in this way. You should standardize the numbers per row. The way to do this is to divide counts within a row by the corresponding row total. This gives the conditional proportions for the response variable within each row. When you do this, it's good to show or mention that the fractions in these rows sum to one, and also give the total number per row, so that you can always get the joint frequencies back. Now, you can clearly see in my example that indeed in the earlier Renaissance, more fruits were being painted, whereas in the Baroque, flowers and mixed displays were more popular. The late Renaissance shows an immediate pattern. I asked Paul whether he had an idea how to proceed to show that there is some sort of relation among the two variables. He suggested to calculate the correlation coefficient. I confirmed that it's indeed possible to calculate the correlation coefficient for this type of data, but also pointed out that categorical data are different from ordinal or numerical data, in the sense that the ordering of the categories is arbitrary. We could have equally well chosen to interchange the columns of fruits and flowers, and then the numerical pattern would be very different. A measure for association should, however, still give the same value. I continued to explain that instead of stating that association between variables is measured, you could also say that the dependence is measured. What is painted depends partially on the period in which it was painted. And you could equally well look at the inverse of dependence, independence. In the contingency table, two variables are independent if the conditional distributions for one of the variables are the same for each category of the other variable. And the variables are dependent or associated, if these conditional distributions are not the same. The best is to compare the conditional distribution in each row against the marginal distribution, which is the average of the rows. These statements apply to a population. But in this case, Paul would, of course, not measure the complete population of paintings from these periods. Many paintings have, for example, not survived the ages and Paul can't visit every museum. I explained that if he would take a different sample from his population, the numbers in the table would differ. Therefore, finding a difference between the conditional distributions for the different art periods is, by itself, not providing evidence against independence. These differences have to be sufficiently large, relative to the sampling error. The way to investigate this systematically is by conducting a hypothesis test. The null hypothesis in the test then is that two variables are independent. While the alternative hypothesis is that the variables are dependent. Now, if the null hypothesis were true, there is something interesting about the contingency table. In that case, we could estimate the joint probabilities on the basis of the marginal probabilities. This is in fact the very definition of independence for random variables. So if art period, an objects that are painted are independent, this is the probability table that we expect. Each joint probability follows from the multiplication of the corresponding marginal probabilities. By multiplying each of the probabilities with the total number of observations, we get the frequencies. This equation shows that there's a short cut by which we can also directly calculate the expected frequencies in each cell by multiplying the marginal frequencies and dividing by the total sample size. So now we know what to expect if the null hypothesis was true, and we can measure the difference between the actual observations and their expected value. If the difference is big, it will lead to a rejection of the null hypothesis. It was getting late. I finished my explanation to Paul by emphasizing that the differences between observed and expected values will be different each time you will take a different sample. But that fortunately, you wouldn't have to take multiple samples to reach a conclusion. Because the hypothesis test I introduced would allow to still make a solid statement on the basis of a single sample, while taking the uncertainty due to sample variability in to account. We agree to continue our conversation later that week to see how this hypothesis test would work and if it would, in fact, be helpful for Paul in a study. I’ll try to summarize my explanations to Paul for you. The relative frequencies at which two categorical variables co-occur are conveniently summarized in a contingency table If the observations come from a representative random sample,these relative frequencies can be interpreted in terms of joint and marginal probabilities. And also conditional probabilities can be calculated. The difference between conditional and marginal probabilities is an indication that the variables may be dependent. Another way to measure the dependence is by comparing observed frequencies with those that are expected if the variables were independent. This result can be generalised by using a hypothesis test with as null hypothesis that the variables are independent. It would use the combined information on the differences between the observed and expected frequencies as test statistic and if this overall difference was big, the null hypothesis would be rejected.