overview of generative models

The detail of generative model can refer to generative models

Definition of generative models

given training data, generate new samples from same distribution
begin{aligned}
mathrm{p}_{model}(mathrm{x}) { similar to } mathrm{p}_(mathrm{x})
end{aligned}

Taxonomy of Generative models

'taxonomy of generative models'

PixelsRNN and PixelsCNN

  • this method belongs to explicit density model.

  • Using chain rule to decompose likelihood of an image x into product of 1-d distributions:

begin{aligned}
p_{theta}(x)=prod_{i=1}^{n} p_{theta}left(x_{i} | x_{1}, ldots, x_{i-1}right)
end{aligned}

where $x_i$ represents pixel, $p(x)$ denoets the likelihood of image $x$

  • it’s important to define ordering of those pixels

  • complex distribution over pixels can be sloved by neural networks

  • pros: explicitly compute likelihood $p(x)$.

  • cons: sequential generation is slow.

PixelsRNN

  • generating image pixels from corner, and then using RNN(LSTM) to model dependency on previous pixels.

  • drawback is that the process of sequential generation if slow.

PixelsCNN

  • generating image pixels from corner, and then using CNN to cover context region.

  • generation proceed sequentially is still slow.

Variational Autoenconders

begin{aligned}
p_{theta}(x)=int p_{theta}(z) p_{theta}(x | z) d z
end{aligned}

Autoencoders

  • mapping image $x$ to features $z$ with deep neural networks, $z$ is regarded to capture meaningful factors of variation in data.
  • to learn feature $z$ by reconstruction error: $x$ -> $z$ -> $hat_{x}$ with encoder and decoder.

Variational Autoenconders for generation problem

  • Assuming training data $x(i)$ representation $z$ is generated from latent representation $z$. Specifically, first sampling $z$ from true prior $p_{theta}(z)$, then sampling $x$ from true conditional $p_{theta}(x|z^(i))$

  • the aim is to estimate the true parameter $theta$ by maximuming likelihood of training data :

begin{equation}
p_{theta}(x)=int p_{theta}(z) p_{theta}(x | z) d z
end{equation}

where $p_{theta}(z)$ is gaussion prior, and $p_{theta}(x | z) $ is decoder.

  • intractable optimization problem, because it’s impossible to compute $p(x | z)$ for every $z$ which means the integral operation is fail in this condition.

  • posterior density $p_{theta}(z | x)=p_{theta}(x | z) p_{theta}(z) / p_{theta}(x)$ is also intractable, since $p_{theta}(x)$ is intractable data likelihood.

lower bound of VAE

  • why: intractable, transfer optimization
  • how: In addition to decoder network modeling $p_theta(x|z)$, define additional encoder network $q_{phi}(z|x)$ that approximates $p_theta(z|x)$
  • final optimization objective:
    begin{aligned}
    log p_{theta}left(x^{(i)}right)\
    &=mathbf{E}_{z sim q_{phi}left(z | x^{(i)}right)}left[log p_{theta}left(x^{(i)}right)right]\
    &=mathbf{E}_{z}left[log frac{p_{theta}left(x^{(i)} | zright) p_{theta}(z)}{p_{theta}left(z | x^{(i)}right)}right]\
    &=mathbf{E}_{z}left[log frac{p_{theta}left(x^{(i)} | zright) p_{theta}(z)}{p_{theta}left(z | x^{(i)}right)} frac{q_{phi}left(z | x^{(i)}right)}{q_{phi}left(z | x^{(i)}right)}right]\
    &=mathbf{E}_{z}left[log p_{theta}left(x^{(i)} | zright)right]-mathbf{E}_{z}left[log frac{q_{phi}left(z | x^{(i)}right)}{p_{theta}(z)}right]+mathbf{E}_{z}left[log frac{q_{phi}left(z | x^{(i)}right)}{p_{theta}left(z | x^{(i)}right)}right]\
    &=mathbf{E}_{z}left[log p_{theta}left(x^{(i)} | zright)right]-D_{K L}left(q_{phi}left(z | x^{(i)}right) | p_{theta}(z)right)+D_{K L}left(q_{phi}left(z | x^{(i)}right) | p_{theta}left(z | x^{(i)}right)right)
    end{aligned}
    where $mathbf{E}_{z}left[log p_{theta}left(x^{(i)} | zright)right]$ is used to reconstruct the input data, $q_{phi}left(z | x^{(i)}right) | p_{theta}(z)$ is used to make approximate posterior distribution close to prior.

  • lower bound: transfering intractable problem into tractale problem with lower bounds, introducing encoder $_{phi}$ to transfer two KL divergence, in which one is tractable while another one is untractabel. Thus, lower bound aims at optimizing the tractable KL divergence.
    'lower bound'

  • the detail of optimization procedure as follows:
    'procedure of optimization'

GAN

begin{equation}
{min {G} max {D} V(D, G)=}
{mathbb{E}{boldsymbol{x} sim p{text {data}}(boldsymbol{x})}[log D(boldsymbol{x})]+mathbb{E}{boldsymbol{z} sim p{boldsymbol{z}}(boldsymbol{z})}[log (1-D(G(boldsymbol{z}))]}
end{equation}

where $D(x)$ represents the probability of $x$ sourced from real data.