Variational Autoencoders 1: Overview
Variational Autoencoders 2: Maths
Variational Autoencoders 3: Training, Inference and comparison with other models
Last time we saw the probability distribution of with a latent variable
as follows:
(1)
and we said the key idea behind VAEs is to not sample from the whole distribution
, but actually from a simpler distribution
. The reason is because most of
will likely to give
close to zero, and therefore making little contribution to the estimation of
. Now if we sample
, those values of
will more likely to generate
in the training set. Moreover, we hope that
will has less modes than
, and therefore easier to sample from. The intuition of this is the locations of the modes of
depends on
, and this flexibility will compensate the limitation of the fact that
is simpler than
.
But how can help with modelling
? If
is sampled from
, then using
we will get
. We will then need to show the relationship of this quantity with
, which is the actual quantity we want to estimate. The relationship between
and
is the backbone of VAEs.
We start with the KL divergence of and
:
The unknown quantity in this equation is , but at least we can use Bayes rule for it:
Rearrange things a bit, and apply the definition of KL divergence between and
, we have:
(2)
If you forget everything, this formula is the thing you should remember. It is therefore important to understand what it means:
- The left-hand-side is exactly what we want to optimize, plus an error term. The smaller this error term is, the better we are in mazimizing
. In other words, the left-hand-side is a lower-bound of what we want to optimize, hence the name variational (Bayesian).
- If
happens to be a differentiable function, the right-hand-side is something we can optimize with gradient descent (we will see how to do it later). Note that the right-hand-side happens to take the form of encoder and decoder, where
encodes
into
, and then
decodes
to reconstruct
, hence the name “Autoencoder”. However, VAEs don’t really belong to the family of Denoising and Sparse Autoencoders, although there are indeed some connections.
- Note that
on the left hand side is something intractable. However, by maximizing the left hand side, we simultaneously minimize
, and therefore pull
closer to
. If we use a flexible model for
, then we can use
as an approximation for
. This is a nice side effect of the whole framework.
Actually the above maths existed way before VAEs. However the trick was to use a feedforward network for , which gave rise to VAEs several years ago.
Next time, we will see how to do that, and hopefully conclude this series. Then we can move on with something more interesting.
