This week we look at the role of data science in Exposome research. Exposome research brings together different sources of data... which can vary in size and complexity. This requires novel data science solutions. Data science is the process of formulating a quantitative question... that can be answered with data. collecting and cleaning the data... analyzing the data... and communicating the answer to the question to a relevant audience. So, what makes exposome data science special? In exposome research, we are not thinking about just a single exposure and a single outcome... but always about multiple variables that will relate to each other. As well as a number of different outcomes. For example, you've seen that different chemical compounds will produce a host of internal reactions... which interact with each other, and that can produce highly varying health effects... and, ultimately, diseases, ranging from cancer to cardiovasculair disease. Exposome research is also special in that it always combines very different kinds of information... both conceptually and practically. For example, the spectral signature of various compounds can be measured in the air... using sensors as well as in the body: metabolomics. And at the same time, genetics plays an important role... perhaps leading to data about people’s genetic mutation patterns... gene expression, and methylation. On top of that, we need to account for what a person is doing in their life. Are they a smoker? What do they eat? How much exercise are they getting? And do they work in a chemical plant, for instance? And of course, we also need an outcome. Which may be closer in the underlying causal process to these factors, or more distant. For example, you might want to look at a person’s metabolism... or expression of genes we know are related to some disease. Or you might want to look at whether this person develops some type of cancer later in life. Even more likely, you will want to look at all of these factors together as a complex system. In this lecture, I will discuss two special challenges posed by exposome data science. So, here is the first special challenge. The enormous diversity of the variables that exposome science needs you to analyze. Almost always, there are many more variables than there are study subjects. More columns than there are rows. P is larger than N. Let’s take an oversimplified situation to understand what is really going on when this happens. First, consider a single explanatory variable x and a single outcome, let's call it y. Maybe x is a single chemical exposure and y is a disease. In an idealized situation you could think of explanatory variables as... buttons you might press to see what happens to the outcome. Let’s only consider the situation where we see a difference in outcomes. Then we might observe this. Or this. If we see this, barring any coincidence... I can predict that whenever you have x, then you’ll also have y. Vice versa, if I got this then I’ll predict that not having x gives me y. This is the simplest possible prediction model you can think of. No uncertainty, just one x and two rows. But now consider what happens when I have two explanatory variables but also, still only two rows. Now I might get this. Or this. Either way, I might guess x2 =1 leads to y... but I could just as well have been x1 that caused this result. So, the problem of figuring out a causal factor is fundamentally underdetermined. This is a situation known in statistics as... Leading to the fundamental indeterminacy. In the context of regression analysis, you may have heard the term... High-dimensional problems are always collinear, always underdetermined… unless you use some special ideas from statistics. The collection of these tricks is called... Although it may seem like dubious magic... that a fundamentally underdetermined problem can be solved by some “trick”. The basis of high-dimensional data analysis is always that, implicitly or explicitly... we bring in some additional information we think we have about the problem. In the example, you could have some prior knowledge from biology... about which chemical, which x was the more likely cause, for instance. Or you could presuppose that there are at most p causal factors that go into this outcome. Let's say, just one. Here are some of the main approaches to high-dimensional data analysis. The first one is setting an upper bound on the overall complexity of the problem. You could do that through penalization. That is, limiting the maximum explanatory power of any one predictor. That's the approach used in, for example, ridge and lasso regression. You could also use feature sampling, which is used in random forests. And you could use other techniques... such as, dropout, early stopping or data augmentation... which are popular in neural networks. You could also introduce subject-matter knowledge. For example, the chemical compound knowledge that we mentioned earlier. You could use feature selection, or you could use Bayesion priors. Now, all of these techniques are outwardly very different... but end up doing something mathematically very similar to the model you are using. Their implementation details depend on the model you are using and they're too technical for this talk. But if you’re interested, please see the recommended material for this lecture. Whenever you encounter these terms, remember, there is no magic. High-dimensional data analysis is just a way to allow you to make accurate predictions... in spite of collinearity among the predictors. To really be sure about any causal relationships, a lot of additional work is needed. This brings us to the second special challenge of exposome research. Exposome research is not just putting together a lot of different variables haphazardly. But these variables’ effects are supposed to flow into each other in a biologically meaningful way. How can we examine these relationships systematically? Well, there are a few ways you could go about doing this. The first way is by combining multiple studies coming at the problem from different angles. For example, you could examine each sequential biological step individually... and study those in separate studies... to build up a comprehensive picture of the disease pathway and its relation to the environment. A different approach would be to measure all of these variables at the same time. Or as many of them as possible, and then model them all simultaneously. A set of techniques that tries to do this are structural equation models. Or their modern formulation structrual causal models. In principle there is a lot of advantage in trying to build up... such a comprehensive picture of the pathways and environmental influence in one go. But this type of approach is still being developed for high-dimensional problems... and for data with heterogenous error structures. I would like to conclude with the following. Whatever approach you choose... it's important to remember that there is no silver bullet for figuring out high-dimensional causality. The most important is that each scientific claim... based on one of the approaches mentioned in this lecture... is accompanied by a careful program of evaluation and triangulation. In the end, the best way to approach exposome data science... is therefore to build up the evidence from as many different angles as we can. Large datasets and collaborative data collection projects are indispensable to that. And data scientists can help think through the issues and analyses. But subject-matter knowledge also remains indispensable.