Next, let's look at the one papers on generating discrete patient records using GAN, and that's a paper publishing Machine Learning for Healthcare Conference in 2017. Trying to address a very important problems in applying machine learning in healthcare domain. Healthcare domain, the data is actually quite challenging in terms of access. Where massive amount of data start to become available and being collected by house care or organizations such as hospitals. But because of the privacy concern, it's very difficult to access those type of data. How do we deal with this privacy concern? Because data are sensitive, one approach is we can try to de-identify the data by some perturbation approach. For example, you cannot re-identify what those patients are by adding some noise and hide identifiers, but is not always work. There's a lot of privacy research, demonstrate the identified data can still be re-identified with minimal effort. The other approach is what we are talking about here, maybe we can generate some synthetic data. The challenge theorists, can you generate realistic enough data to support downstream applications such as machine learning application? So that's what we want to achieve with this medGAN model. We want a generated realistic EHR data using GAN. We want the generated examples to be high-quality. It said here, if we can support training real machine learning models, and those models performed similarly to some other model trained on real data, then we consider the quality of the synthetic data are high. Then we have privacy preservation criteria that is in a high level. If we can't guarantee that's practically impossible to gain knowledge on real patients from the synthetic dataset, then we're doing a good job for preserving privacy of the original patient records. In this talk, we just focus on the first part when we demonstrate the synthetic records are of high quality. The paper had also shown some experiment demonstrating the privacy preservation aspect. Let us look at the medGAN Architecture. It's pretty much a GAN model in itself, but with a small twist of adding auto-encoder at the beginning. Here's the architecture's illustration. We have x indicate the real data, and z indicate the random noise, which has input to the generator. Then we have this encoder and decoder. That's the model of the auto-encoder. Then we have G indicated generator, and D indicate the discriminator. We can talk about the different component in more details here. First, let's talk about the encoder part. The data we want to generate our discrete data, indicating whether specific diagnosis are present or not in the patient record. That's actually quite difficult to to be handled by a GAN. In this work, first, using auto-encoder trying to reduce the dimensionality of the original discrete data. But also, through this auto-encoder process, the embedding or the output of the encoder become continuous lower dimension. That's what the GAN's trying to learn. Here, this is just illustrating this pre-train auto-encoder model from the real data. This is the loss of the auto. They are using a cross-entropy loss, and X_i indicate the real patient record i and m is number of patient record in the training dataset, and x prime_i indicate the reconstruction of this auto-encoder, so that's just applying the coder on top of an order on the real patient record. We want the reconstruction to be close to the original input patient record. Then we have the generator here. The generator we're trying to is taking a random noise and then pass that through a neural network of this sort where it's random noise then goes through a ReLU with specialization and also have a skip connection here. This multiple layer of such a neural network will get a deep neural network over here, and that will output the embeddings of the patient record. The embedding of the patient record will be decoded with the decoder from the pre-train auto-encoder in order to get something more discrete. That will be the input to the discriminator. Then we have the discriminator. The discriminator's actually pretty straightforward. It would just take the decoded result, treat that as a fake samples, and take the sample directly from the original training data x, use that as a real example, and trying to do a binary classification here. Let's look at the experiment. The experiment, they conducted a set of real dataset, and then from Sutter House and from this particular research study, and the goal here is trying to show the synthetic data are realistic. The way they achieved that is they can do dimension-wise probability. So that's to show that the marginal distribution of the synthetic data match the marginal distribution of the training data. So that's dimension-wise probability. Then they also measure the classification performance taking each of those dimensions or each of the features as a target then trying to build a classifier. So if the classifier perform well trained by the synthetic data, that means that the synthetic data are good. Let's look at this two task in more details here. The dimension-wise probability or at this marginal distribution check, so this is just plotting the scatter plot of the marginal distribution for all dimension. There's over 700 dimensions in the dataset, each one of those corresponding to a diagnosis category. You can see that the x-axis is the real data distribution and y-axis is the distribution from the synthetic data. If they're on the diagonal, like medGAN does, it shows really good marginal distribution. Then we have a bunch of other methods, and VAE is another one we'll talk about in this lecture, and they have other model as well. I want to point out the first one is adding the random noise. You can see that even if you add too much random noise for the small probability events, you will still get that wrong a lot of time. But overall, this is for the simpler task, many of the model can get it pretty well. This is expected, and the top line showing a different variation of GAN that we have tried. You can see at the end, we got to this very good performance coming from medGAN, but if you're just using traditional GAN without a lot of the encoders we introduce, you will see actually the performance can be pretty arbitrary sometimes. That's one task. Another task is to do the dimension-wise prediction. This is a much harder task. The setting is this: we pick one dimension or one features from the dataset as our target, then use the remaining dimension as features for a logistic regression classifier. On one side, we have the real training data and we can train a model M1, then apply that training data to some hold-off test set, real test-set, from real patient data, to compute the performance. We can also do this MedGAN pass by taking the real training data, train the MedGAN, then the MedGAN would generate a bunch of fake examples, then we use only the fake example to train a second model, M2, then apply M2 on the real hold-off test set and then compare the performance of M1 and M2. If they perform similarly, that means that the fake example actually captures pretty much the same level of information as the real training data. In term of a machine learning task, that'll be good. We're doing this dimension by dimension. Pick one dimension as the target, then use the remaining as features when trying to build this predictive model, iteratively, for all of those dimensions. We have 900 dimensions, then we build 900 models. Here is the performance of this dimension-wise prediction. Each dot, again, indicate a particular target, it could be a disease, in this case, and then the x-axis is the performance of the model trained with real data, and the y-axis is the performance of the model trained on the synthetic data or algorithm generated. You can see that MedGAN perform pretty similar in most of the cases, slightly worse than the model trained on real data, but not very much. That's why you can see the scatter of all those models along the 45-degree lines here. But on the other hand, if you have all the other baseline method, they perform much worse. Even including VAE, they could perform considerably worse in this particular task. So anything below this 45-degree line means the performance is worse in the synthetic data, the model trained by the synthetic data compared to the real data. I want to point out that adding noise, that's a standard perturbation approach, just to introduce more variance, it's still around the 45-degree line, but not much better than what we have. Actually, not much better at all. This is another commonly used way to generate synthetic data. It's called just independent sampling, dimension by dimension, trying to keep the statistic per dimension the same, but it doesn't capture the correlation across all different dimensions, and that's why the model performed pretty badly. So you can see that the conclusion is that the data generated by MedGAN actually performed great for a machine learning task like classification models. We also did qualitative evaluation here. The idea here is, take 50 random sample of fake example, generated by MedGAN, then mix those with 50 real examples, then shuffle them and present them one by one to a real doctor. Ask him to score those records in a scale of 1-10, 1 being the most unrealistic examples and 10 being the most realistic example of patient record. Then we can compute a box plot and see that, in fact, MedGANs, the synthetic records and real records, actually, it's very difficult to distinguish, even by a human. That's why the average is actually very close, only with a few exception. There are some outliers that are very easy to see they're fake, but in most of the cases, they're quite good in term of the realistic level to human doctors. So to conclude, MedGAN is an extension or a variant of GAN model, to generate high-quality synthetic patient records, that can support machine learning tasks, and they demonstrate with dimension-wise prediction task and also qualitative evaluation by a human doctor. They also have separate experiments on privacy preservation to confirm that it's actually very difficult for an attacker to gain additional informations about real patients and the details are in this paper. That's GAN.