Let's talk about another generative model is called variational autoencoder or VAE. VAE is a generative model for creating realistic data samples. Very similar to GAN, but VAE is not only a deep-learning model, it also have very strong statistical or probability foundation. So it's in the intersection of deep learning and the probabilistic graphical model. VAE has many different house care application including molecule generations and medical imaging analysis. Variational autoencoder or VAE, to understand that, let's first review a more primitive version of that called autoencoder, which we have talked about in earlier lectures. Autoencoder is a neural network that does dimensionality reduction by mapping this input vector x using an encoder network and map that into a latent code, lower-dimensional space, as embedding h, then have a decoder network to map h back to output r which is a reconstruction of our original input x. This is unsupervised in a sense that the final loss is some version of reconstruction error. We want to measure the difference between the output, the reconstruction r and the input x. You can use a different loss function where there's a squared Euclidean loss or some cross-entropy loss. That's autoencoder which we have talked about. The problem with autoencoder is, they often overfit this type of process, but just trying to map some input to some random latent space and then as long as you can reconstruct that using a decoder, then that's good enough. You will not see a good structure in the latent space. What I mean by that is when you just perturb the latent code and then applying the same decoder, you will not get something similar to the original data. This autoencoder oftentimes, this latent code h is very sensitive. It's a good compression of some version of the input, but it's not very generalizable latent space, and so it's not good for generating some realistic input samples. VAE want to fix that. The idea of VAE is we still want to do this encoder and decoder strategy in unsupervised way. Here, the main difference is instead of encode some high dimensional input vector x into a vector or a data point, we want to encode this input as a distribution. That's the idea. We want to first encode x into some distribution. In VAEs case, it's a Gaussian distribution. We want to map this x into some distribution parameter in the case of Gaussian as the mean and variance, and then we want to sample through that distribution. This is the latent embedding here, sample from this distribution. That's a vector at z. Then once we have the z vectors, we can decode that sample to generate a reconstructive version x prime. The additional steps comparing to the autoencoder is this sampling process. That's one step differences. But also the loss function is different, which we'll talk about. In this deep-learning viewpoint, we have two neural network. One is this encoder network, also called the inference network, and tried to learn this probability distribution of q sita z given x, so we're trying to learn this distribution. Then the decoder is also called generative network. I'm trying to learn another conditional distribution, p phi x given z. Let's look at this in more details. The encoder is trying to learn the distribution parameter of a Gaussian distribution of mu x and variance Sigma x. This Gaussian's parameter, mu, is a function of this input vector x. That's the Encoder network q Zeta. We also have some prior distribution, a p_phi z, where that's on the latent embedding z. We assume that to be a Gaussian was the 0 mean and unit variance, and the loss function as actually of this term. For a given data point x, the loss has this two term, the first term as the negative log-likelihood of p_ phi x given z. This is the first term is like the reconstruction error in the auto Encoder setting. It has this second term, which is a KL divergence between q theta z given x and p_ phi z. This is second term serve as regularization term. In the standard auto Encoder, you don't have the second term in trying to just want to minimize the negative log-likelihood here. But that can be too over fitting. With this regularization term, we want not only the reconstruction area to be good, to be small or the log-likelihood to be large. We also want this Encoder network, this distribution close to the normal distribution. That's the VAE from deep learning perspective. Also, have an important trick called re -parameterization trick as its an implementation trick. But it's a very essential here. You may wonder that if you think about this as a neural network, was the Encoder network, then Decoder network, and then to Encoder, if you do it naively, you may think, ''Okay, give me an Input x.'' We will pass through this Encoder network to get the parameter for the distribution mu_ x and sigma x. Then, we can sample from that distribution, that goes with the mean and variance to get the vector z. Then, we can go through the Decoder network to get the reconstruction x. But there's a problem. The problem is, if you does this sampling during the training phase, the gridding would not be able to pass back, when you introduce this sampling process. It's not straightforward back-propagation anymore because of the sampling steps that you introduced into this VAE model. That's a problem. But luckily, with the re-parameterization trick, you can address this issue of sampling by doing something different. In fact, the previous one, if you implement this naively, it's actually equivalent to doing something like this. Instead of having an Input x, you have Input x and also another input epsilon. This epsilon is just random noise with normal distribution with zero mean and unit variance. Then, that's both our input through your network. Then you go through the Encoder's network to get the mu_x and sigma_x. Instead of doing this sampling, you can actually figure out, calculate that z with this equation. Z is just mu_ x plus this is sigma to the square root power effects and times Epsilon. This is just standard deviation. Mu plus standard deviation times Epsilon. That's actually equivalent to just sampling with this mu_ x and sigma _x as a parameter to a Gaussian distribution. As long as you have a sample from normal distribution of 0 mean and unit variance. You can construct that another sample was different mean and the variance very easily. This equality sign is crucial. Now, you have z, without even doing sampling anymore. Because you're literally just calculating with this equation, then you can pass this through the Decoder network together, output reconstruction version of x prime. This way, you can still do the back propagation because this is equal and equal sign. There's no sampling them off anymore. All you need to do is just introducing an extra input to your original Input, with 0 mean and unit variance. You can still do the back propagation with this re-parameterization trick. This is the introduction of VAE with deep-learning view. You may wonder, ''Okay, and why all those sub probability or conditional probability or normal distributor, why all those things are there, all right? " It turns out it has a very strong theoretical foundation. We will try to cover that in the next few slides to give you a flavor why VAE is a solid model. For any generative model, you're trying to learn the joint distribution. To learn the joint distribution, one important part of the step is to figure out how to calculate the posterior probability, that is, in this particular case is p of z given x. If you are familiar with base rule you will see that this posterior probability of p z given x is p of x given z times pz and divided by p of x. That's the base rule. The tricky part for this calculation is this denominator, p of x, because p of x is integration of the joint distribution p of x and z over dz. This integral is expensive because you have to integrate all the values of z, so that's expensive, difficult to compute. This posterior probability for some arbitrary encoder or arbitrary distribution is difficult to calculate. Maybe we can approximate this p Phi z given x with another distribution, q Theta z given x so that this q Theta is actually a simpler distribution. Z given x is some arbitrary posterior probability, which is hard to compute. But we can enforce this q Theta z given x to be something simpler. In this particular case, normal distribution, and such that the KL divergence between this two distributions is small, so the KL divergence of q Theta z given x, and p Phi z given x is small. Or try to minimize this by finding the right q Theta z given x. That's our objective. The way to get there is recognizing this equality. You can verify this offline, but it turns out this log-likelihood of p Phi given x, consider this as a constant because x is the input data, so this is the log-likelihood of the data or the evidence we see, and then it equals these two terms. The first term is this crazy-looking expectation of log p of Phi x and z divided by q Theta z given x. The second term is the KL divergence we're looking for, the KL divergence of q Theta z given x and p Phi z given x. This first term has a name, it's called the Evidence LBO or ELBO because the sum of these two terms equal to some constant. If we want to minimize the second term, we want these two distributions to be close, it's equivalent to maximizing the first term, this ELBO term. That's what we want to do in this computation process because this ELBO term turns out to be simpler. If you carry out some derivation, you will find out this ELBO term is actually these two term, expectation of log p Phi x given z and minus this KL divergence for q Theta z given x, and p Phi z, or this priority of distribution of z. You recognize that this ELBO term turns out to be our data point x, which we have shown here earlier in the deep-learning view. You can see that this first term is negative log-likelihood that gives the reconstruction error and the second term is the regularization we have been talking about. That's why it's equivalent. That shows the theoretical foundation of VAE, why this particular term, KL divergence of these two come up in the last function, and why we model the VAE this particular way.