overview of generative models

The detail of generative model can refer to generative models

Definition of generative models

given training data, generate new samples from same distribution
begin{aligned}
mathrm{p}_{model}(mathrm{x}) { similar to } mathrm{p}_(mathrm{x})
end{aligned}

Taxonomy of Generative models

PixelsRNN and PixelsCNN

this method belongs to explicit density model.
Using chain rule to decompose likelihood of an image x into product of 1-d distributions:

begin{aligned}
p_{theta}(x)=prod_{i=1}^{n} p_{theta}left(x_{i} | x_{1}, ldots, x_{i-1}right)
end{aligned}

where $x_i$ represents pixel, $p(x)$ denoets the likelihood of image $x$

it’s important to define ordering of those pixels
complex distribution over pixels can be sloved by neural networks
pros: explicitly compute likelihood $p(x)$.
cons: sequential generation is slow.

PixelsRNN

generating image pixels from corner, and then using RNN(LSTM) to model dependency on previous pixels.
drawback is that the process of sequential generation if slow.

PixelsCNN

generating image pixels from corner, and then using CNN to cover context region.
generation proceed sequentially is still slow.

Variational Autoenconders

general explanation is here VAE
this method defines intractable density function with latent $z$:

begin{aligned}
p_{theta}(x)=int p_{theta}(z) p_{theta}(x | z) d z
end{aligned}

Autoencoders

mapping image $x$ to features $z$ with deep neural networks, $z$ is regarded to capture meaningful factors of variation in data.
to learn feature $z$ by reconstruction error: $x$ -> $z$ -> $hat_{x}$ with encoder and decoder.

Variational Autoenconders for generation problem

Assuming training data $x(i)$ representation $z$ is generated from latent representation $z$. Specifically, first sampling $z$ from true prior $p_{theta}(z)$, then sampling $x$ from true conditional $p_{theta}(x|z^(i))$
the aim is to estimate the true parameter $theta$ by maximuming likelihood of training data :

begin{equation}
p_{theta}(x)=int p_{theta}(z) p_{theta}(x | z) d z
end{equation}

where $p_{theta}(z)$ is gaussion prior, and $p_{theta}(x | z) $ is decoder.

intractable optimization problem, because it’s impossible to compute $p(x | z)$ for every $z$ which means the integral operation is fail in this condition.
posterior density $p_{theta}(z | x)=p_{theta}(x | z) p_{theta}(z) / p_{theta}(x)$ is also intractable, since $p_{theta}(x)$ is intractable data likelihood.

lower bound of VAE

why: intractable, transfer optimization
how: In addition to decoder network modeling $p_theta(x|z)$, define additional encoder network $q_{phi}(z|x)$ that approximates $p_theta(z|x)$
final optimization objective:
begin{aligned}
log p_{theta}left(x^{(i)}right)\
&=mathbf{E}_{z sim q_{phi}left(z | x^{(i)}right)}left[log p_{theta}left(x^{(i)}right)right]\
&=mathbf{E}_{z}left[log frac{p_{theta}left(x^{(i)} | zright) p_{theta}(z)}{p_{theta}left(z | x^{(i)}right)}right]\
&=mathbf{E}_{z}left[log frac{p_{theta}left(x^{(i)} | zright) p_{theta}(z)}{p_{theta}left(z | x^{(i)}right)} frac{q_{phi}left(z | x^{(i)}right)}{q_{phi}left(z | x^{(i)}right)}right]\
&=mathbf{E}_{z}left[log p_{theta}left(x^{(i)} | zright)right]-mathbf{E}_{z}left[log frac{q_{phi}left(z | x^{(i)}right)}{p_{theta}(z)}right]+mathbf{E}_{z}left[log frac{q_{phi}left(z | x^{(i)}right)}{p_{theta}left(z | x^{(i)}right)}right]\
&=mathbf{E}_{z}left[log p_{theta}left(x^{(i)} | zright)right]-D_{K L}left(q_{phi}left(z | x^{(i)}right) | p_{theta}(z)right)+D_{K L}left(q_{phi}left(z | x^{(i)}right) | p_{theta}left(z | x^{(i)}right)right)
end{aligned}
where $mathbf{E}_{z}left[log p_{theta}left(x^{(i)} | zright)right]$ is used to reconstruct the input data, $q_{phi}left(z | x^{(i)}right) | p_{theta}(z)$ is used to make approximate posterior distribution close to prior.
lower bound: transfering intractable problem into tractale problem with lower bounds, introducing encoder $_{phi}$ to transfer two KL divergence, in which one is tractable while another one is untractabel. Thus, lower bound aims at optimizing the tractable KL divergence.
the detail of optimization procedure as follows:

begin{equation}
{min {G} max {D} V(D, G)=}
{mathbb{E}{boldsymbol{x} sim p{text {data}}(boldsymbol{x})}[log D(boldsymbol{x})]+mathbb{E}{boldsymbol{z} sim p{boldsymbol{z}}(boldsymbol{z})}[log (1-D(G(boldsymbol{z}))]}
end{equation}

where $D(x)$ represents the probability of $x$ sourced from real data.