[MUSIC] Welcome back, everyone. The lessons in this module have shown the importance of starting with well-defined business or research questions and then formulating a clear analytical plan. This sometimes involves using agreed upon steps such as the SEMMA methodology. The last lesson illustrated how simply extracting high quality data can be a complex and labor intensive task. This lesson continues with the theme that is not only the E or the extract aspect of ETL that is time consuming. In addition, sometimes the T or the transformation step can also take a lot of effort. For example, sometimes, statisticians or data scientists do not want to perform this labor intensive task. It is also possible that they do not have the SQL skills or knowledge about the complex claims or EMR data to do this work efficiently. At the end of this lesson, you will be able to describe the types of ETL processes that are common among healthcare analytical projects, let's get started. Before going into more details about creating analytical tables, it is worth mentioning that many data projects have an end goal of extracting data for manual review. For example I have helped with healthcare research projects in which I write complex algorithms to extract a cohort the patients. But in the end, the researchers only want one column files that's simply has a list of patient numbers. With the list of patients ID numbers, the researchers can go into the front GUI, or graphical user interface of the electronic health record system, and study various attributes of the patients in detail. After these in depth reviews and assessment, they can then decide if they want to recruit the patient into a study. I have worked on other projects in which business analysts want to review samples of claims or clinical data, so that they can learn about a process or a behavior. However, at least at this stage they do not care very much about the structure or the quality of the data that's been extracted. Thus for them it is not a problem to have some errors, incomplete data, or duplicate rows. The analyst could ignore this, and use their eyes to see what patterns or data they want to see. In some, sometimes the final product is not an analytical table that needs to be analyzed by a statistician or a data scientist. All that said, it can still be helpful to let business analysts know that with more detailed transformation efforts by the programmer, the file can become more readable or assessable to humans and computers. It is a trade-off about who has more time and what value might be gain from performing more programming. This course assumes that most students have a background at computer programming or at least are familiar with how programmers transform data to meet various analytical needs. If you have done this work, you know that it's not very difficult to write the code to make transformations. What can be difficult is the time that it takes to get this work done. Moreover, often non-programmers do not appreciate how much work it takes to extract and transform data to make curated or analytic ready files. Thus it can lead to some frustration among team members. Let me now mention a few common transformation tasks. In the next sections, I will say a little bit more about some of these areas. First, it's often useful to aggregate numeric data. As mentioned earlier, healthcare analysts often face the challenge of having many records per unit of analysis. For example, if a patient level file is required, it might be necessary to aggregate or collapse numerous lab values into one number. This can be done by taking the average, or some other measure of central tendency. Or consider claims data in which we need one row per beneficiary. An analyst could summarize their monthly contributions as the total amount contributed over a few years. Second, let's discuss selecting specific records. Like aggregation, when dealing with categorical data across many rows, sometimes an analyst needs to pick one data row. A subject matter expert interested in comorbidities among patients might ask that an analyst pick the three most common diseases in their eclaims or EHR data files. Then new columns could be added to show the presence of specific conditions. Of course, the problem with collapsing categorical data is it sometimes, the number of categories can grow very large. Then the team must choose which categories are the most important. Finally, recoding data values is often necessary. Data values found within fields might have values that differ from what are required by the analytical plan. For example, a business problem might look at race and ethnicity but only needs to look at a few different categories. Assuming the data source has many categories for race and ethnicity, such as Pacific Islander, Hispanic American Indian, it might be necessary to write some if then statements to recode the data into the proper mappings. Also consider transposing, often in healthcare databases there are many rows per event and tables can capture the information by adding additional rows, but sometimes analysts prefer or algorithms require that the data are transposed into a wide format. Analysts often must shift the structure of the data to satisfy these requirements. A task that is not difficult but often does take some time. Now allow me to go back to the topic of groupers. We learned earlier how powerful grouper models can be used in healthcare analytics. Open source and commercial opportunities allow analysts to categorize or group both cell values and rows of data into more broad categories. These more general categories aid when performing predictive modeling and other analytical tasks. Yet to be useful, the groupers need to be applied to the new raw data. With groupers that categories codes such as ICD, it is often relatively easy to add the grouper file into the analytical environment. Then a programmer can use programming languages such as SQL or SAS or any language for that matter to join the data based on various codes of interests. The additional group or columns are then available for use. This approach applies the clinical classification software, BETOS procedure groupers, and other tools that categorize codes. It can be much more challenging to apply groupers that look at multiple data fields and aggregate data across multiple records. For example, the symmetry episode treatment groups can summarize huge numbers of claims per beneficiary and create much more usable episodes. However, the analyst needs to take great care to pre-process the input data so that the algorithms can correctly create the intended output. This means an analyst needs to carefully review the documentation about how to prepare the raw data for transformation by the group or system. Analyzing sequences is an important task but it often takes additional effort to transform the data into the proper structure. Temporal data are often stored in relational tables. Thus an analyst might need to use SQL to retrieve events from different tables and then associate the time stamp for that table. For example, a list of prescriptions might be one type of event. And the researcher may want to include the timestamp for the day the prescription was created or filled The business analysts or researchers might also require an analytical table in which specific events have been chosen based on their sequence. For example, they may want to know, of cancer patients that died, find out when they took their first oral chemotherapy drug. Then look at other clinical events separated in time, by specific number of days before and after the chemotherapy drug was taken. Please put this information to one row per patient so that I can understand all of these with any statistical software. Okay, even though it's relatively easy for a good programmer to retrieve all the raw data and time stamps, it is much harder to analyze the data to create actionable information from the raw data. The code might be complex to write and the programmer might get lost in the code without knowing what they should be looking for. I'm an experienced programmer, and my greatest challenges often relate to processing temporal data. Even if the requirements are clear enough, I have found that data quality problems, and data formatting issues leads the challenges. Overall, the relational database model makes temporal queries challenging and since these are often at the core of healthcare analyses, additional tools and methods are critical. Software extension such as T-SQL, another temporal database query languages solve some of the problems, but these are still complex. It is especially challenged to analyze intervals between events. And finally let's get back to transposing data. It is a very common situation in which extracted data are not in the correct structure that is required by the analyst. Sometimes this is driven by preference for how the analyst likes to look at the data. But often there are constraints imposed by various data mining algorithms. For example, sometimes a regression or predictive modeling algorithm in software such as R is expecting a format that I sometimes refer to as a tall rather than a wide format. A tall file is one in which there are multiple rows, and then a field or fields with specific data values that indicate different attributes. An example of this would be a field that indicates various outcomes for an analysis. Each row will indicate if the outcome is event A, B, C or D for example. Alternatively, a wide data set would have separate fields to capture the outcomes of A, B, C, and D. Then each row would have a flag of maybe 0 or 1 to indicate the presence or absence of these variables. If the algorithm is expecting a specific format, a programmer will have to transpose the data to fit this requirement. In addition to constraints driven by algorithms, the analyst or subject matter experts often have preferences about the structure of the data. I have found that people with less programming experience often prefer one row per patient. In this way, they can use Excel or rather spreadsheets to understand the data with their eyes, or by using simple counts and methods of sorting and filtering the data. Okay, excellent. This wraps up this lesson. You should now be able to describe the types of extract, transform, and load, or ETL processes that are common among healthcare analytical projects. You should also better appreciate the attention and effort required to perform ETL and the importance of doing so in healthcare. Excellent, we have completed this module. I hope that you have a greater appreciation for the importance of analytical data plans, and the challenges associated with extracting and transforming data for analytical healthcare projects. Thank you very much.