expectation maximization from jensen’s inequality

Abstract

In this post we introduce a common view on Expectation Maximization using Jensen’s inequality.

Reference

EM and maximum likelihood estimation

The goal of EM is to maximize the likelihood of observed variable X explained by hidden variable Z under a model with
parameters $theta$ . Noted that when Z is known (usually as labels), this problem becomes a supervised learning problem and
it can be solved directly using maximum likelihood estimaation instead of EM. For a training dataset with size $n$ , the expected log likelihood of
a dataset with instances $x_1$ , $x_2$ , … $xn$ drawn from distribution $p{true}(x)$ is given by:

$begin{align} mathbb{E}_x (log p_{theta}(x)) approx frac{1}{n} sum_{i=1}^{n} log p_{theta}(x_i), quad x_i sim p_{true}(x) tag{0} end{align}$

Our objective is to maximize (0) by adjusting $theta$

$% <![CDATA[ begin{align} operatorname*{argmax}_{theta} mathbb{E}_x (log p_{theta}(x_i)) &= operatorname*{argmax}_{theta} n cdot mathbb{E}_x (log p_{theta}(x_i)) \ &= operatorname*{argmax}_{theta} sum_{i=1}^{n} log p_{theta}(x_i) \ &= operatorname*{argmax}_{theta} sum_{i=1}^{n} log sum_{z_i} p_{theta}(x_i , z_i) \ &= operatorname*{argmax}_{theta} l(theta) tag{1} end{align} %]]&gt;$

Derivation from Jensen’s inequality

Starting from (1), derivation using Jensen’s inequality adds a “helper” distribution $q$ such that

$% <![CDATA[ begin{align} l(theta) &= sum_{i=1}^{n} log sum_{z_i} p_{theta}(x_i , z_i) \ &= sum_{i=1}^{n} log sum_{z_i} frac{q(z_i) p_{theta}(x_i , z_i)}{q(z_i)} tag{2} \ & geq sum_{i=1}^{n} sum_{z_i} q(z_i) log frac{p_{theta}(x_i , z_i)}{q(z_i)} tag{3} end{align} %]]&gt;$

(2) is greater than or equal to (3) because Jensen’s inequaitliy states that $f(E[Y]) geq E[f(Y)]$
if $f$ is a concave function (for more details, look at Andrew’s note or wikipedia).
In our case, $f$ is $log$ which is a concave function and $Y$ is $frac{p(x_i , z_i; theta)}{q(z_i)}$ .
Using (3) as a lower bound of (2),
we want to raise (3) as close as possible to (2) such that (3) $=$ (2) to maximize the likelihood.
For this, Jensen’s inequality also states that when $Y$ is constant, $f(E[Y]) = E[f(Y)]$ .
Therefore, we can construct $q$ such that Y is equal to some constant c as follow

$Y = frac{p_{theta}(x_i , z_i)}{q(z_i)} = c tag{4}$

If we assume each instance has the same corresponding constant c, then we have

$frac{sum_{z_i} p_{theta}(x_i , z_i)}{sum_{z_i} q(z_i)} = c tag{5}$

Given $q(zi)$ is a probability distribution and $sum{z_i} q(z_i) = 1$ , we have

$c = sum_{z_i} p_{theta}(x_i , z_i) = p_{theta}(x_i) tag{6}$

Therefore, $q(z_i)$ can be written as follow

$q(z_i) = frac{p_{theta}(x_i , z_i)}{p_{theta}(x_i)} = p_{theta}(z_i | x_i) tag{7}$

Expectation Maximization

(7) is our E step

$q(z_i) = p_{theta}(z_i | x_i) tag{E-step}$

Using (7) to replace $q(zi)$ in (3), we obtain our M step assuming $p{theta}(x_i vert z_i)$ is constant

$% <![CDATA[ begin{align} theta_{new} &= operatorname*{argmax}_{theta} sum_{i=1}^{n} sum_{z_i} p_{theta}(z_i | x_i) log frac{p_{theta}(x_i , z_i)}{p_{theta}(z_i | x_i)} \ &= operatorname*{argmax}_{theta} sum_{i=1}^{n} sum_{z_i} p_{theta}(z_i | x_i) log p_{theta}(x_i , z_i) - sum_{z_i} p_{theta}(z_i | x_i) log p_{theta}(z_i | x_i) \ &= operatorname*{argmax}_{theta} sum_{i=1}^{n} sum_{z_i} p_{theta}(z_i | x_i) log p_{theta}(x_i , z_i) - C \ &= operatorname*{argmax}_{theta} sum_{i=1}^{n} sum_{z_i} p_{theta}(z_i | x_i) log p_{theta}(x_i , z_i) tag{M-step} end{align} %]]&gt;$

Notes

For (5), it seems this assumption is necessary but ignored from Andrew’s note and some other sources.
How does such assumption affect the construction of $q(z_i)$ ?

expectation maximization from jensen’s inequality

Abstract

Reference

EM and maximum likelihood estimation

Derivation from Jensen’s inequality

Expectation Maximization

Notes

近期文章

近期评论

标签

热门

文章归档

分类目录

功能