[MUSIC] In the last module we discussed k-means as an algorithmic approach to clustering. Now in the rest of the course we're going to turn to probabilistic mode-based approaches to clustering and compare and contrast with k-means. So why are we interested in probabilistic approach in the first place? Well to motivate this we could return to our task where we're interested in learning a user's preference over a set of topics. And to do this we talked about clustering documents into some set of topics and then using the user feedback on the articles that the user read, combined with this clustering structure, to learn the set of preferences of this user over these topics. But we're rarely in situations where the data are clearly separable. So for example, in this scenario imagine that there's an article where it's not really clear. Is this article similar to the other articles that are about world news, or is it more similar to the other articles about science? Well maybe it's somewhere in between. But let's say, for argument's sake, that it's a little bit closer to this science topic, so this cluster for this orange cluster. Does that mean when the user said that they did not like this article, that we should only include that information when we're going to update information about the users preference for Cluster 4? Or maybe that should also somehow weigh into what we think the user likes about Cluster 2. So what we see is that, these hard assignments made by k-means don't really tell the full story. Instead what we like to do is somehow be able to capture the uncertainty about whether the assignment of this article is to Cluster 2 or to Cluster 4. And then use that uncertainty to extract extra information from the data set. So for example we can think about doing fancier things when we're going to learn the user's preference over a set of topics. And there are other limitations to k-means as well. So remember that when k-means goes to do an assignment of a data point to a cluster, it only takes into consideration the cluster centers. So what that means is it's taking nothing about the cluster's shape into account. So you can think of this as being equivalent to assuming that the clusters all have these spherically symmetric shapes to them and they're all equal size spheres. So what that means is that all dimensions get weighted equally, and it's the same weighting in all of these different clusters when we're going to compute that score. So the only thing that differentiates between these clusters is where those clusters are located, their cluster centers. You could, however, modify k-means to use weighted Euclidean distance or scaled Euclidean distance instead of just this vanilla form of Euclidean distance that we're showing here. But there are two things about this. One is that it still assumes that all of the clusters have the same axis-aligned ellipses. Meaning that when we're weighing the importance of the different dimensions, it's exactly the same weighting in each of the different clusters. And the other thing, is that the set of weights have to be prespecified. As a result, k-means struggles in the following cases. One is if the clusters have different spreads to them, especially if they're not very, very well separated. So for example take the following case where there's these two little orange and blue clusters, and one really big green cluster. And when I say big, I don't mean big in terms of the number of data points in the cluster, I mean in terms of the spread of points within that cluster. And what the k-means algorithm is going to do in this case is if we fix the cluster centers to be at these three locations here, and we look at our Voronoi tessellations, so that's what these lines are drawing. So it's saying everything in this region is getting assigned to this orange cluster. And everything in this region here is getting assigned to this blues cluster and only the things here are getting assigned to this green cluster. So what's happened is it's taken observations that appeared in this green cluster and it's going to make assignments of these data points to the wrong clusters, to the orange or to the blue cluster. Another issue is when you have overlapping clusters, because again if we think about fixing our cluster centers here, k-means is simply going to split these clusters up as follows, where what happens now isn't necessarily incorrect labeling. I mean it is incorrect to take these blue points here and assign them to the orange cluster. It's not really clear what should be done in those situations if you see a data point here without knowing ahead of time that it's from the blue or orange cluster, having any other distinguishing information. Really want you have in these regions, I'll just shade them in blue these regions here.. As you should have uncertainty about whether the observation belongs to the blue cluster or the orange or the blue or the green or the blue, sorry green and orange. But k-means is making hard assignments. K-means, Makes hard, Assignments when clearly there's uncertainty. And the other thing that we said that k-means does not account for is if the clusters have different shapes or orientations to those shapes. And especially if those shapes are not known ahead of time what the weightings on the different dimensions should be. So for example, if we look at this observation right here under k-means, it says, well the distance to this blue cluster's smaller. So this observation will get assigned to the blue cluster and it's not taking into consideration that this green cluster has a lot of spread along this axis. And so what we'd like is an approach that can account for that shape to the cluster and say well, instead of just looking at the cluster centers, I want to take in to consideration that this green cluster has this spread along this dimension. And maybe it's more likely that it's in the green cluster than in the blue cluster which is this nice kind of tight spherical cluster. So these limitations motivate a probabilistic approach using something called a mixture model. And what a mixture model can do is it can provide what are called soft assignments of data points to clusters. And these soft assignments are going to account for uncertainty in these cluster assignments. So for example, maybe we have a document. And the soft assignment is going to say that there's 54% chance that that document is about world news, 45% chance that it's about science, 1% chance that it's about sports and 0% chance that it's about entertainment. And mixture models can also account for cluster shapes, not just the cluster centers. And finally, another really important thing is we're going to end up learning the parameters of these mixture models and what we can do is we can actually learn the weightings on the different dimensions. So for example, we can, per cluster, learn the relative importance of the different words in the vocabulary. [MUSIC]