Okay, so that's a VAE, Variational Autoencoder. Next, we talk about an application of VAE for drug discovery. So a little bit of background about just drug discovery, and it's a very important domain. And drug discovery and development is the process of identifying new drugs that are safe an effective for treating a certain disease. So this process is long and expensive. It takes more than 10 years, over 2 billion US dollars for develop a single drug. So this process started with drug discovery, and which identify the target to treat, usually a protein. Then finding some molecules that interact with this target, those molecules are called hit. After that we get into the drug development phase, and there the overall goal of drug development is to determine whether some of the hits are safe and effective for treating the disease. They start with in vitro lab test, then followed by in vivo test with animals to determine the property of the drug candidates. If successful, then you can get to the human trials. So there's different phases of the human trials. So phase one does a test the safety of the drug. Phase two test the efficacy of the drug, and phase three, the effectiveness of the drug, comparing occur in standard care. And there's also phase four, it's done after drug is approved for long lasting side effect and safety. So, it's is a very long process. In this list, the time span in each phase on average and the amount of money spent on average on each phase, so it's a very expensive process. So deep learning, or more generally, AI and machine learning have good applications in all those phases. And so in, I mean, in this context we're going to give you an example for the drug discovery phase, that is, identify some promising molecules. And that's the very beginning step. So, Molecule generation with VAE. This is the paper that covers this particular topics. And the idea is actually quite simple, the goal is, if we have seen many many molecules that's in our database, we know their properties, can we learn some generative model to generate new molecules with similar properties? So the molecule are stored in this format called SMILES, we'll talk about the next slide. And then their idea is using a Variational Autoencoder, VAE, and to map this, all the known molecules into some latent space, Z. And then they can go through this decoding process to get, potentially, a new SMILES string or new molecules. So that's what they have done in this work. So little details, want to cover the data. The input data of molecules are represented with a string. So they are using these systems or this particular standard called Simplified Molecular-Input Line-Entry System. That's very common or in a very much standard in chemistry. Then how do you represent molecules? And essentially, this called SMILES or SMILES strings. To construct the SMILES string, essentially, you're doing a traversal over the molecule graphs in a depth-first search manner. And follow the atoms and so just turn this molecule, which is a graph structure, into a line or into a string. It also has some special character indicating some structures like ring. Because this is molecule, it has the ring, possible ring, so it have some special characters handles that, and also using a bracket to handle the branches. So that's a SMILE string. But from our perspective you can imagine or consider the input as a sequence or a sequence of characters. And to use a deep learning model on this sort of data you can convert this sequence of characters into a sequence of one-hot encoding, right? So this is pretty much standard way to encode sequence. We have used it in text data and we're using it here as well at each. This is a SMILE string represents this particular molecule graph and for each of this character, we're going to assign 1 high encoding for that, right? So, corresponding dimension will be 1, and the rest will be 0. And this is a sequence of 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, characters and you have a sequence of lens 12. And so that will be used as input to the neural network to the VAE model and that's how we represent the input. For the generation process, molecule generation using VAE, what they have done is the input is SMILE string as sequence. And then we use the Recurrent Neural Network, RNN to encode the sequence into a fixed size embedding. Because the input string can be arbitrary long, right? So it's easier if we can encode them into a fixed length vector, so RNN is a good model for modeling sequence. That's what they used to convert this sequence into a fixed length vectors, and that will be mapped into this kind of the embedding space using, keep in mind, right, for VAE we were using this normal distribution in the latent space. And then we have a decoder network to kind of map that back to another SMILE string. An in addition to the standard VAE, they also have a separate pass right from the latent space. They want to learn of neural network to predict property of those molecules so that give us more control over what type of molecules were looking for. We are looking for molecule with certain property, right? So this give us additional supervision signals that we can bring in to this VAE model. So the data set they use in the experiment that used to open data set and one is called the QM9 as over 108,000 molecules. And then another one is ZINC database have over 250,000 drug-like molecules. And the idea's use supposed datasets, right, separately to train two different models and see how those models works in terms of generating new molecules. Those generated molecule they can check their properties and hopefully there are similar to the ones in the database. So here's the result, right? So in this table, right, this is two different data set you use, right? And here each row has some kind of a method or the original data. And then they compare on three different properties. This logP, this synthetic accessibility score and drug-likeness score, QED. So you can see in the origin data that's the corresponding score and the variance. And then you can see the ones that using VAE, will achieve similar scores as original data. And on both data Center for QM9 you have the similar kind of scores. And the other algorithm GA stands for genetic programming, that's the previous state of art for the chemist using to generate molecules. They use genetic programming. That's also kind of powerful algorithms, but turns out in this particular case it seems like VAE model can generate, Molecule that are closer to what the original data look like. But in the generic programming, at least in the QM9 data set, you can see the difference can be quite big in some properties. So it shows that the VAE model seems to be able to capture the property in the original training data better and to be able to generate more realistic molecules and comparing to the input data.