mixture model exploration

1. What is Mixture Models?

Mixture model: a model in which the latent variable takes on one of a finite, unordered set of values. This latent variable represents unobserved subpopulations present in the sample.[^cs.princeton]

The quick formula for mixture model

file:///C:/Users/dell/Documents/Tencent%20Files/1151664945/FileRecv/MobileFile/IMG_8955.JPG

Variables’ quick reference:

Variable Meaning Restriction    
K number of components constant    
z latent variable: [0 .. 0 1 0 .. 0] $z$ is to describe each sample: $z$ ~ Multinomial ($phi$); $z_j$ ~ Multinomial($w_j$)    
$phi$ prior probability of each mixture component $sum_{i=0}^k phi^i =1$    
i /k the $i^{th}$ component      
j the $j^{th}$ sample      
d the $d^{th}$ dimention of a multivaritate distribution      
D total number of dimentions for a multivariate distribution constant    
n/N total number of data points or samples constant    
$theta_i$ set of variables for $i^{th}$ component distribution      
$x_j$ / $x$ each sample point      
$w_j^i$ / $gamma(z_j^i)$ weight/ responsibility of sample $x_j$ assigned to cluster i expectation of the assignment variable z; $w_j^i = P(z_j^i=1 x_j,hattheta) = E(z_j^i x_j,hattheta)$
M total categories for multinomial distribution      
m the $m^{th}$ category in multinomial mixture model      
  • Mixture proportion and mixture component
    For a single sample $x$, the likelihood of it coming from the mixture model parameterized by $theta$ equals the likelihood of $x$ coming from each distribution (mixture component) $times$ (mixture proportion) $phi$

$color{DarkTurquoise}{I am here to try some color}$

  • Clustering:
    If you get a new data sample $x^*$, decide it should go to which component: write out the posterior and apply Beyesian rule

  • Likelihood
    The likelihood (L) and the log likelihood ($l$) of a set of data $N = {x_1, . . . , x_n}$ are as follow:

(leave out $z_i$, as it doesn’t make a difference in calculation and is easier to understand)

The parameters to estimate are $theta$ and $phi$. Solve by minimizing the log likelihood.


2. How to Solve Mixture Model?

Try to take the derivative of log-likelihood with respect to one parameter $theta_m$ (the $m^{th}$ component). Note: $f(x_j;theta_m) = P(x_j^i theta_i)$




We are doing weighted maximum likelihood, with weights given by posterior cluster probablities. These to repeat, depend on the parameters ($theta$) we are trying to estimate. —- A vicious circle. [^stats.cmu]

Introduction to EM Algorithm

Alternating between assigning points to clusters and finding cluster centers. —- K-Means
Find out which component each point is most likely to have come from; re-estimate the components using only the points assigned to it. —-EM

Algorithm steps:

Steps K-Means EM
1 set k; choose initial centroids set k; initialize $theta_1…theta_k$, $phi_1…phi_k$
2 assign data $x_j$ to a cluster i: $z_j^i=1$ by a distance function E: Using the current parameter guesses, calculate the weight (soft assignment variable $w =gamma(z)$ : the posterior probablity of $x_j$ in each cluster )
3 calculate the positions of centroids $mu_i=frac{sum_{j=1}^n(z_j^ix_j)}{sum_{j=1}^n(z_j^i)}$ M: Using the current weights, maximize the weighted likelihood to get new parameter estimates. —- For each component, take the sample by portion assigned to it, then do parameter estimation as in basic distributions. And then recalculate $phi$ $color{Red}{why phi calculated this way}$
4 iterate 2 and 3 until the assignment, z, no longer change Iterate 2 and 3 until the log-likelihood is below some threshold. Return the final parameter estimates (including mixing proportions $phi$) and cluster probabilities: $w$ for each sample

Charasteristics:

K-Means EM
largely dependent on initial assignment, no guarantee guarentee to converge to local optimal

Calculation of EM algorithm (examplified with Gaussian Mixture)

  1. Initialize the $theta_k$: $( mu_k, Sigma_k),phi_k$ and evaluate the initial value of the log likelihood.
  2. E step Evaluate the responsibilities/ weight using the current parameter values

    $color{Red}{connection to naive bayes?}$
    $color{Red}{Yes!There is mixtures of naive bayes}$
  3. M step Re-estimate the parameters using the current responsibilities
    $N_k = sum_{j=1}^n gamma(z_j^k)$


  1. Evaluate the log likelihood

And check for convergence of either the parameters or the log likelihood.

Bonous: Proof and Details about EM Algorithm

Read section: more about EM algorithm


3. (Multivariate) Berboulli/ Multinomial Mixture models 1

https://users.ics.aalto.fi/jhollmen/Publications/courseware.pdf
mixture model summary

Bernoulli Mixture model:

  • parttern recognition: hand-written letter recognition
    http://users.dsic.upv.es/~ajuan/research/2004/Juan04_08b.pdf
  • finding motifs in sequence data (eg. DNA)

Consider a multivariate Bernoulli distribution:
A set of D discrete variables $x_d$, where d = 1, … D, each of which is
governed by a Bernoulli distribution with parameter $mu_i$, so that
$p(x_1,ldots x_D|mu_1 ldots mu_D) = prod_{d=1}^{D}
mu_d^{x_d}(1-mu_d)^{(1-x_d)} $
E[x] = $mu$
cov[x] = diag{$mu_d(1-mu_d)$}

$color{DarkTurquoise}{——}$

  • Be careful, for a multivariate distribution, the mean is a d-dim vector and covariance is an d by d matrix.

<font color=blue> A Problem: cov here can’t really capture the correlation between different dimentions. (If we consider each $xd$ represents a pixel in a picture and therfore the chance of getting all pixels taking on values exactly the same as they are in this picture is a multinomial Bernoulli distribution)
</font>
Solve: To capture the interpixel correlation: consider a finite mixture of these distributions given by $p(x|mu,phi) = sum
{k=1}^K phi_k p(x|mu_k)$

The mean and covariance of this mixture distribution are given by:
E[x] = $sum_{k=1}^K phi_kmu_k$
cov[x] = $phi_k { Sigma_k + mu_kmuk^T} - E[x]E[x]^T$ where $Sigmak = diag{mu{kd} (1-mu{kd})}$

$color{Red}{Task}$
[Task: Review and check cov matrix calculations for general and Bernoulli and Multinomial distribution]

Problem solved: The cov matrix is no longer diagonal so the mixture distribution can capture correlations between variables unlike single Bernoulli distribution.

Bernoulli Mixture Calculation EM:

Derive the EM algorithm with the below log likelihood for mixture of Bernoulli distributions

The complete data likelihood:

TRICK: $ln (sum_{k=1}^K zj^k ()) = sum{k=1}^K z_j^k ln()$ since only one $z^k$ = 1 and 0 elsewhere

The complete data log likelihood:

Take expectation of the complete-data log likelihood with respect to posterior distribution of latent variables $mathbf{z}$:

where $gamma (z{jk}) = E[z{jk}]$

E step:


Consider sum over j in $E_z$ equation, responsibilities enter only through two terms:


M step: Maximize the expected complete-data log likelihood with respect to $mu_k$ and $phi$ by setting the derivative w.r.t to $mu_k$ *That’s why the lengthy $E_z[ln p(x,z mu,phi)$ need to be calculated*

Obtain

w.r.t. $phi$, a Lagrange multiplier to enforce constraint $sum_k phi_k =1$

Obtain

Extend to Multinomial Mixture Model:

Consider a D-dimensional variable x each of whose components i is itself a multinomial variable of degree M so that x is a binary vector with components $x_{dm}$ where d = 1,…,D and m = 1,…,M,subject to the constraint that $sum_m x_{dm} = 1$ for all i. Suppose that the distribution of these variables is described by a mixture of the discrete multinomial distributions so that

where

Note:
x = [$mathbf{x_1},…mathbf{x_D}$] and $mathbf{x_d} = [x_{d1},…x_{dM}]^T$
$mu_{kdm} = p(x_{dm} = 1 | mu_k)$
$mu_k = [mu_{k1} … mu_{kD}]$ and $mu_{kd} = [mu_{kd1},…,mu_{kdM}]$ constrained by
0 $<=mu_{kdm} <=$ 1 and $sum_mmu_{kdm} = 1$(to ensure multinomial distribution for each component & dimention)

Derive mixing coefficients $phi_k$ and the component parameters $mu_{kdm}$:

  1. Latent variable $Z_n$ corresponding to each observation.
    $color{Red}{TODO: Derivation not understood}$

E step


M step



For clustering tasks, the $gamma(z{n}) = [gamma(z{1})…gamma(z_k)]$ is used.


Clustering with Both Categorical and Numeric Values

Difference in EM lies in $theta$, and $p(x_n|theta_k)$.

General Form:
E-step
Find $gamma(z_{nk}) = frac{phi_k p(x_n|thetak)}{sum{k=1}^K phi_k p(x_n|theta_k)}$
M-step
Estimate parameter $theta$ using maximum likelyhood estimation

Details
E-step
General:

Assume there is no dependency between numeric columns and categorical columns.

For Gaussian Distribution independently in different dimensions:

If the numeric dimensions are modeled by multivariate Gaussian Distribution:

M-step

Gauusian:


Multinomial:



Look for variance and covariance http://wcms.inf.ed.ac.uk/ipab/rlsc/lecture-notes/RLSC-Lec3.pdf
Image: perhaps https://pdfs.semanticscholar.org/6b8e/8ffc6d9ef96fe61bcc92152e1f63ed4c0d59.pdf


4. How to Code Mixture Models?

Python pomegrante library
R mixtools: mixture of multinomials

5. Extentions of Mixture Models

Conjugate Prior: To get a reasonable prior $p(theta)$

Introducing prior distribution: similar as getting more valid observations of x, maximize prior probablity w.r.t $theta$
Then maximize posterior probablity (likelihood) of $theta$
$p(theta|x)$ in same distribution family with $p(theta)$

Distribition Conjuate Prior Process Involved
Berboulli Beta  
Multinomial Dirichlet MCMC
Gaussian    

Resourses:

  1. http://home.iitk.ac.in/~apanda/cs300/5.pdf
  2. https://www.mrc-bsu.cam.ac.uk/wp-content/uploads/EM_slides.pdf
  3. https://people.duke.edu/~ccc14/sta-663/EMAlgorithm.html
  4. file:///C:/Users/dell/Desktop/10.1007%252F978-0-387-35768-3.pdf
  1. http://www.rmki.kfki.hu/~banmi/elte/bishop_em.pdf