The goal of cluster analysis is to group or cluster observations into subsets based on the similarity of responses on multiple variables. Observations that have similar response patterns are grouped together to form clusters. That is, the goal is to partition the observations in a data set. That into a smaller set of clusters and each observation belongs to only one cluster. Cluster analysis is an unsupervised learning method. Meaning, there is no specific response variable included in the analysis. Question analysis is often used in marketing, to develop targeted advertising campaigns. For example, question analysis using data on types of groceries people buy, can group people together based on their buying patterns. The results can be use to develop individual purchased profiles to target specific advertisements incentives to people depending on their buying patterns. This is often referred to as market segmentation. The same approach can be used in any number of fields. For example, health researchers might use cluster analysis to identify individuals at greatest risk for health problems, and to develop targeted health messages based on patterns of health behavior. With cluster analysis, what we want is to obtain clusters that have less variance within clusters and more variance between clusters. That is, we want observations within clusters to be more similar to each other than they are to observations in other clusters. Less variance within clusters means that the observations within a particular cluster are similar to each other in their pattern of response on the clustering variables. More variance between clusters means the clusters are distinct. That is, there is little to no overlap between the clusters. Cluster analysis can also be used as a data reduction technique that allows you to take many variables and reduce them down to a single categorical variable that has as many categories as the number of clusters identified in the dataset. This categorical variable can then be used in other analysis to predict some response variable of interest. There are many types of clustering algorithms, in this course we are going to focus on K-means cluster analysis, which is one of the most commonly uses clustering algorithms. K-means cluster analysis, is conducted by creating a space that has as many dimensions as the number of input variables. The input variables are designated with the notation P, so p-dimensional space is formed. The distance between observations in this space is used to determine how the data are partitioned. Cluster analysis measures the distance between points in the p-dimensional space, and groups together those observations that are close to each other. There are multiple ways to calculate the distance between observations. The most commonly used distance measuring, K-means cluster analysis, is call Euclidean distance. The Euclidian distance measure determines how close observations are to each other by drawing a straight line between pairs of observations, and calculating the distance between them based on the length of this line. To demonstrate how cluster analysis works, let's look at a simple example. For example, say we have p=2 variables and we want to create K=2 clusters. We can plot these observations in a single scatter plot representing the two dimensional space where each of these points represents an observation in the dataset. The first step in the Camians cluster analysis is to randomly choose two points in the two dimensional space. These points will start as the center or centroid for each of the two clusters. Then the distance between each point and cluster centroids is calculated. For example, we can take this point and measure the distance between it and one centroid, and then between it and the second centroid. Then the point is assigned to the cluster that has the smallest distance between it and the cluster centroid. So in this case, this point is closest to the blue centroid. So it is assigned to the blue cluster. This is done for every point or observation. After the initial formation of the clusters, the centroid for each cluster is recalculated based on the location of the points that were assigned to it. Specifically, it is relocated to the place at which all the points and the centroid are smallest. Then the process starts all over again by calculating the distances between the points and the new locations of both centroids. Reassigning points to the closest centroid, and then relocating the centroid to the place where the sum of the new distances for the points assigned to the cluster is at the minimum. This process is repeated using multiple iterations until the location of the centroids doesn't change very much. During the process, observations that were originally assigned to one cluster may end up in a different cluster. For example, these observations here were previously assigned to the blue cluster, but ended up being assigned to the red cluster. By the time the iterations were complete. Likewise, these observations here were previously assigned to the right cluster, but ended up being assigned to the blue cluster in the end. So let's run cluster analysis to see how it works. For this example, we want to use a wide array of variables that represent characteristics of adolescence that could have an impact on school achievement. The ultimate goal may be to develop a few targeted interventions to improve academic achievement that are targeted to the needs of specific student population subgroups based on the characteristics of students in clusters. To do this we are going to use a subset of the variables from the dataset that we used for the last regression analysis example. The variables we are going to use for clustering include two binary variables measuring whether or not the adolescent had ever used alcohol or marijuana. Quantitative cluster variables include alcohol problems, a scale measure during engagement in deviant behaviors such as vandalism, other property damage, lying, stealing, running away, driving without permission, selling drugs and skipping school. Also included are a scale for violence, depression, self esteem, parental presence, parental activities, family connectedness, and school connectedness. We're going to validate the clusters by excluding grade point average from the cluster analysis. Grade point average is a measure of academic achievement, so we should expect to see differences between the clusters in grade point average. If we do find differences, then we have some evidence that the clusters are valid in terms of identifying subgroups of adolescent. The specific patterns of academic achievement related to characteristics on the clustering variable, and that taking into consideration these patterns might lead to targeted interventions that are more successful in improving academic achievement.