Variational Autoencoders 1: Overview
Variational Autoencoders 2: Maths
Variational Autoencoders 3: Training, Inference and comparison with other models
Recalling that the backbone of VAEs is the following equation:
Image may be NSFW.
Clik here to view.
In order to use gradient descent for the right hand side, we need a tractable way to compute it:
- The first part Image may be NSFW.
Clik here to view.is tricky, because that requires passing multiple samples of Image may be NSFW.
Clik here to view.through Image may be NSFW.
Clik here to view.in order to have a good approximation for the expectation (and this is expensive). However, we can just take one sample of Image may be NSFW.
Clik here to view., then pass it through Image may be NSFW.
Clik here to view.and use it as an estimation for Image may be NSFW.
Clik here to view.. Eventually we are doing stochastic gradient descent over different sample Image may be NSFW.
Clik here to view.in the training set anyway.
- The second part Image may be NSFW.
Clik here to view.is even more tricky. By design, we fix Image may be NSFW.
Clik here to view.to be the standard normal distribution Image may be NSFW.
Clik here to view.(read part 1 to know why). Therefore, we need a way to parameterize Image may be NSFW.
Clik here to view.so that the KL divergence is tractable.
Here comes perhaps the most important approximation of VAEs. Since Image may be NSFW.
Clik here to view. is standard Gaussian, it is convenient to have Image may be NSFW.
Clik here to view. also Gaussian. One popular way to parameterize Image may be NSFW.
Clik here to view. is to make it also Gaussian with mean Image may be NSFW.
Clik here to view. and diagonal covariance Image may be NSFW.
Clik here to view., i.e. Image may be NSFW.
Clik here to view., where Image may be NSFW.
Clik here to view. and Image may be NSFW.
Clik here to view. are two vectors computed by a neural network. This is the original formulation of VAEs in section 3 of this paper.
This parameterization is preferred because the KL divergence now becomes closed-form:
Image may be NSFW.
Clik here to view.
Although this looks like magic, but it is quite natural if you apply the definition of KL divergence on two normal distributions. Doing so will teach you a bit of calculus.
So we have all the ingredients. We use a feedforward net to predict Image may be NSFW.
Clik here to view. and Image may be NSFW.
Clik here to view. given an input sample Image may be NSFW.
Clik here to view. draw from the training set. With those vectors, we can compute the KL divergence and Image may be NSFW.
Clik here to view., which, in term of optimization, will translate into something similar to Image may be NSFW.
Clik here to view..
It is worth to pause here for a moment and see what we just did. Basically we used a constrained Gaussian (with diagonal covariance matrix) to parameterize Image may be NSFW.
Clik here to view.. Moreover, by using Image may be NSFW.
Clik here to view. for one of the training criteria, we implicitly assume Image may be NSFW.
Clik here to view. to be also Gaussian. So although the maths that lead to VAEs are generic and beautiful, at the end of the day, to make things tractable, we ended up using those severe approximations. Whether those approximations are good enough totally depend on practical applications.
There is an important detail though. Once we have Image may be NSFW.
Clik here to view. and Image may be NSFW.
Clik here to view. from the encoder, we will need to sample Image may be NSFW.
Clik here to view. from a Gaussian distribution parameterized by those vectors. Image may be NSFW.
Clik here to view. is needed for the decoder to reconstruct Image may be NSFW.
Clik here to view., which will then be optimized to be as close to Image may be NSFW.
Clik here to view. as possible via gradient descent. Unfortunately, the “sample” step is not differentiable, therefore we will need a trick call reparameterization, where we don’t sample Image may be NSFW.
Clik here to view. directly from Image may be NSFW.
Clik here to view., but first sample Image may be NSFW.
Clik here to view. from Image may be NSFW.
Clik here to view., and then compute Image may be NSFW.
Clik here to view.. This will make the whole computation differentiable and we can apply gradient descent as usual.
The cool thing is during inference, you won’t need the encoder to compute Image may be NSFW.
Clik here to view. and Image may be NSFW.
Clik here to view. at all! Remember that during training, we try to pull Image may be NSFW.
Clik here to view. to be close to Image may be NSFW.
Clik here to view. (which is standard normal), so during inference, we can just inject Image may be NSFW.
Clik here to view. directly into the decoder and get a sample of Image may be NSFW.
Clik here to view.. This is how we can leverage the power of “generation” from VAEs.
There are various extensions to VAEs like Conditional VAEs and so on, but once you understand the basic, everything else is just nuts and bolts.
To sum up the series, this is the conceptual graph of VAEs during training, compared to some other models. Of course there are many details in those graphs that are left out, but you should get a rough idea about how they work.
Image may be NSFW.
Clik here to view.
In the case of VAEs, I added the additional cost term in blue to highlight it. The cost term for other models, except GANs, are the usual L2 norm Image may be NSFW.
Clik here to view..
GSN is an extension to Denoising Autoencoder with explicit hidden variables, however that requires to form a fairly complicated Markov Chain. We may have another post for it.
With this diagram, hopefully you will see how lame GAN is. It is even simpler than the humble RBM. However, the simplicity of GANs makes it so powerful, while the complexity of VAE makes it quite an effort just to understand. Moreover, VAEs make quite a few severe approximation, which might explain why samples generated from VAEs are far less realistic than those from GANs.
That’s quite enough for now. Next time we will switch to another topic I’ve been looking into recently.
Image may be NSFW.
Clik here to view.
Clik here to view.
