We left off in the previous lesson in the midst of clarifying definitions that are essential to know in health care data analytics, so that you can effectively communicate clearly with others in this field. We will resume where we left off with important data mining and predictive modeling definitions. This lesson is part two of this topic. Now, let's consider classification. Classification seeks to predict which set of classes that individuals or observations belong to. The classes are usually mutually exclusive. For example, a hospital readmission or no readmission. Additional examples would be a good email versus a spam email. A group of patients could be classified as having cancer or not having cancer. Classification also includes class probability estimation. With this, it is possible to assign a probability that an individual is associated with a particular class. As we'll see in our future lessons about risk stratification, we can use a data set that has a column with preexisting categories of readmission or no readmission. We can then use variables or attributes so to speak, to predict this target or label. If we have a new data set without the labels and we want to predict which patients are likely to have a readmission, we can use our previous model to score the new patients for the likelihood of this outcome. Regression is another type of a predictive tool, but the objective is not to assign individuals to classes. The outcome or target is a numeric quantity and the objective is to predict that something will occur. For example, for a given patient, how many days will they likely spend in the hospital? Statisticians often use the term regression, but this is sometimes confusing. The term is not related to the common definition of regressing back to some different state. Instead, regression is about predicting quantities and not thinking about transitions in the common sense of the term. Clustering is an approach of grouping data into groups or clusters. The clusters are formulated by the similarity, but unlike similarity matching, there's not a specific purpose for the groups. The data themselves drive the groupings. The approach assumes the data elements within clusters are more similar than data elements between the different clusters. In addition, there's usually no clear outcome target or dependent variable. With cluster analysis, it is necessary to apply labels and meaning to the clusters, and of course, this is not always an easy task. Now, allow me to define co-occurrence grouping. There are many names for this general approach. These include set mining, association rules mining, and market basket analysis. The term I usually use for this approach is association rules mining. However, the term market basket analysis might be the most concrete and easy to conceptualize. In market basket analysis, the business problem might be to identify which types of food or products people tend to buy together. For example, when people shop, what items tend to end up in their basket? Overall, these methods allow us to see which items or objects co-occur or are correlated. We can use the market basket approach to see what types of diseases or drug prescriptions co-occur. Let me change course a little bit and discuss temporal data analysis. The categorization and definitions in this section are borrowed from a nice data mining book titled Data Mining, Concepts and Techniques by authors Han and Kamber. This is a complex and growing area of data mining and statistics, and I suspect that other books have slightly different categories and definitions. There are three main types for temporal data mining; stream, time series, and sequence. Let's consider each in turn. First, stream data or data streams are created from real-time surveillance systems. This include; communication networks, Internet traffic, online transactions in the financial market, or retail industry, electric power grids, industry production processes, or scientific and engineering experiments. For instance, in health care, we have the electrocardiogram abbreviated EKG or ECG. These are tests that measure the electrical activity of the heart beat. Stream data often come in and out of computer systems continuously often with variable update rates. Second, time series data have sequences of events that are usually repeated measures of time. The temporal measures are usually equal time intervals such as minutes, hours, days, or years. Some good examples of time series data include economic forecasting, stock market evaluations, weather forecasting, or quality control measures with industry. In health care, an example of time series data would be a heart monitor that keeps track of patient's electrical signals. Another example might be blood pressure readings that are taken every 15 minutes in a hospital setting. Finally, sequence data are similar to time series and that this data type also has events order within time. However, unlike time series, these data may or may not have specific units of time such as hours or days. An example is the order in which people navigate web pages. The sequence of clicks that brings people to specific landing pages may matter. In health care, it is often important to know when treatments and prescriptions occur in relation to other outcomes or adverse events. Now, let's shift again and do a quick review of visual analytics. The human eye has an amazing potential to process visual information. If presented in a clear and dynamic matter, patterns associated with data might appear. Visual analytics is also a powerful tool for data cleaning. For example, researchers have found that visualization can vastly improve the first steps of the analytical process associated with data cleaning and transformation efforts. When data are visualized, users can rapidly identify patterns that suggest data entry errors or duplicate records. Visualization tools provide powerful interactive and visual interfaces to allow users to review the quality of the data, specify patterns of interest to explore, and then interact with this visually. With new visualization software and dashboard technology, there's often an opportunity to create visualizations. But just as with the data modeling algorithms, the raw data often needs to be extracted and transformed before it can easily be processed into clear visualizations. With that quick reminder about what data scientist often focus on, we are now prepared to think about how to extract data from health care environments. These extracts can then be used as valid inputs to these various models and techniques. Okay. We'll see you in the next lesson.