We begin by discussing this simplest learning problem, which is that of estimating a set of parameters. And the key building block, for parameter estimation is what's called maximum likelihood estimation, so we're first going to define that and then build on that in the context of more interesting problems. So, let's go back to the simplest possible learning problem where we are generating data from a biased coin which is a Bernoulli Distribution and the probability that x takes the value one is some, defined by some parameter data, and the probability that it takes the value zero is therefore one minus data. And let's imagine that we're given access to a data set D, which is a bunch of instances x1 up to xM that are sampled IID from the distribution P. Let's remind ourselves what it means for, what the IID means. So first, the first I means that the tosses are independent of each other. And the ID means that they're identically distributed which means that they're all sampled from P. And our goal is to take D and reconstruct data. So, let's think about this as a probabilistic graphical model, in fact, it's probabilistic graphical model that we've seen before when we first talked about Crate models. So here, we have a parameter, theta, and a bunch of, replicates of the same experiment, which are these x's over here. Which our sample from theta independently. And we can see that if you think about this as a graphical model as written over here, we can see that each xi depends on theta. And that they're conditionally independent of each other once theta is known. So, they're independent and identically distributed, given theta. And if you think about this as a graphical model, well, it better have CPDs, and so the CPD for the M toss, given theta is the part, is that the x takes the value x1 with probability, theta, and the probability, the value x0, with probability one minus theta, and that's a completely legitimate CPD. It's a CPD whose parent is the value theta, which is the parent of the x variable. It's a continuous variable, but that's, we've done those before. So, now that we've specified this as a probablistic graphical model, we can go back and think about how maximum likelihood estimation would work. And so, the goal is to find the, parameter of state L, which is in interval zero, one that predicts the D well. So, the quality of our, of a particular parameter theta is how well it predicts D, which can also be viewed as how plausible is D given a particular value of theta. So, let's try and understand what that means. We can ask, what is the probability on the D that we saw, given a particular value of theta? And since the value, the, the tosses, or the instantiations xi are conditionally independent given given theta, we can write this as the product over all of the m instances of the probability of the nth instance given theta. And so now, let's think about what that is in the context of a concrete example. Let's assume that we have tossed this coin five times and we have three heads and two tails or three ones and two zeros. So, if we actually write down what this probability function looks like, we can see that that's going to be the probability of getting head given theta times the probability of getting tails, given theta times the probability of the second tail and the probability of heads times another probability of heads. And the probability of the first head given theta is theta one minus theta for the tail, one minus theta, theta, theta. Or theta to the power of three, one minus theta to the power of two and that is exactly the function that is drawn over here. Now, if we're looking for the theta that predicts D, well, we just defined that as the theta that maximizes this function, if you draw a line from this maximum down to the bottom, you can see that this function is maximized at the point near point six, which, not surprisingly, is the same as the three heads that we saw over the five total pauses. But generalizing that, let's assume that we have observed in this context, MH heads and MT tails, and we want to find the theta that maximizes the likelihood and just as, in the simple analysis, in the simple example that we had, this likelihood function is going to have theta appearing MH times and one minus theta appearing MT times. And so, that's going give us a likelihood function that looks just like the likelihood function that we saw on the previous line. And if we think about how we can go about maximizing such a function, then usually we take the following steps. First, it's convenient to think about not the likelihood but rather what's called the log likelihood, denoted by a small l, which is simply the log of this expectant over here and that has the benefits of turning a product into a summation. And now that we, so that gives us a simpler optimization objective but it's the one that has the exact same maximum. And we can now go ahead and do the standard way of maximizing a function like this, which is differentiating the log-likelihood and solving for theta. And that's going to give us an optimum which is exactly as we would expect, that is, it's the fraction of heads among the total number of coin tosses and that's the maximum of this log-likelihood function and therefore, the likelihood is well. Now, an important notion in the context of maximum likelihood estimation is also important in, in when we develop it further is the notion of a sufficient statistic. So, when we computed theta in the coin toss example, we defined the likelihood function as an expression that has this form. And notice this expression didn't care about the order in which the heads and the tails came up. It only cared about the number of heads and the number of tails. And that was sufficient in order to defined a likelihood function, therefore, sufficient for maximizing the likelihood function. And so, in this case, MH and MT are what known, are what's known as sufficient statistics for this particular estimation problem because they suffice in order to understand the likelihood function and therefore to optimize it. So, why generally a function of the data set is the sufficient statistic? if it's a function from instances to a vector in Rk if, it has the following property, if this function s of D, has the following property. For any two data sets, D and D prime, and any possible parameters theta, we have that the following is true. If s of D is equal to s of D prime, then the likelihood function for those two data sets is the same. So, what's the s of D? s of D is the sum of the sufficient statistics for all of the instances in D. So, let's make this a little bit more crisp. What we're trying to do is we're trying to look at a bunch of data sets, for example, all sequences of coin tosses. And we're trying to summarize. data sets using a smaller, more compact notion, which is statistics that throw out aspects of the data that are not necessary for for reconstructing its likelihood function. So, let's look at the multinomial example, which is the generalization of the Bernoulli example that we had before. So, assume that what we have is a set of measurements of a variable x, that takes on k possible values. Then, in this case, the sufficient statistics, just like before, whereas, we had the same number of heads and the same number of tails. Here, we have the number of values, the number of times each of the k values came up. So, we have M1, M2, up to Mk. So, for example, if you're looking for a sufficient, for sufficient statistics for a six-sided die · , you're going to have M1 up to M6 representing the number of times that the die came up one up to the, and number of times it came up two, three, four, five, and six. And so, what is the sufficient statistic function in this case? It's remember, it's a two-fold of dimension, of dimension k, which are the different numbers of values, which are the number of different values of the variable. And the sufficient statistics for the value xi is one where we have a one only in the ith position, and zero everywhere else. So, if we sum up s of xM over all instances in their data set, we're going to get a vector, where, you only get a one contribution when the instance, when the M, when the Mth data instance comes up that particular value, and so that's going to be M1, M2, Mk. And this is sufficient statistic because the likelihood function then can be reconstructed as a product of theta i, Mi, where this theta i here is the parameter for x equals little xi. Let's look at a different example. Let's look at the sufficient statistic for a Gaussian distribution. So, as a reminder, this is a one-dimensional Gaussian distribution that has two parameters, mu, which is the mean, and sigma squared, which is the variance. And that can be written as, in the following form which is one that you've seen before. And we can rewrite the exponent in in the following way, we can basically blow out the quadratic term in the exponent and you end up with the likelihood function that has -x squared times a term plus x times the term minus a constant term. And the sufficient statistics for Gaussian can now be seen to be x squared, x and one. Because when we multiply P of X for multiple occurrences of X, we're going to end up adding up the x squared for the different, for the different data cases, adding up the x's for the different data cases and then this is just going to be the number of data cases. So, the s of data set P is going to be the sum over M, X of M squared, the sum over M with M and this [UNKNOWN]. And from that, we can reconstruct the likelihood function. How do we now perform maximum likelihood estimation? So, as we talked about, we want to choose theta so as to maximize the likelihood function and if we just go ahead and optimize the functions that we've seen on previous slide for multinomial, that maximum likelihood estimation turns out to be simply the fraction. So, for the value xi, it's going to be the fraction of xi in theta, which again, is a perfectly, very natural estimation to use. for a Gaussian, we end up with the following as the maximum likelihood estimate. The mean is the empirical mean. in the distribution. That is the average over all of the data cases and the standard deviation is the empirical standard deviation. So to summarize, maximum likelihood estimation is a very simple principle for selecting among a set of parameters given data set D. We can compute that maximum likely destination by summarizing a data set in terms of sufficient statistics, which are typically considerably more concise than the original data set D. And so, that provides us with a computationally efficient way of summarizing a data set so as to the estimation. And it turns out that for many parametric distributions that we care about, the maximum likelihood estimation has an easy to compute closed form solution given the sufficient statistics.