1. What is Mixture Models?
Mixture model: a model in which the latent variable takes on one of a finite, unordered set of values. This latent variable represents unobserved subpopulations present in the sample.[^cs.princeton]
file:///C:/Users/dell/Documents/Tencent%20Files/1151664945/FileRecv/MobileFile/IMG_8955.JPG
Variables’ quick reference:
Variable | Meaning | Restriction | ||
---|---|---|---|---|
K | number of components | constant | ||
z | latent variable: [0 .. 0 1 0 .. 0] | $z$ is to describe each sample: $z$ ~ Multinomial ($phi$); $z_j$ ~ Multinomial($w_j$) | ||
$phi$ | prior probability of each mixture component | $sum_{i=0}^k phi^i =1$ | ||
i /k | the $i^{th}$ component | |||
j | the $j^{th}$ sample | |||
d | the $d^{th}$ dimention of a multivaritate distribution | |||
D | total number of dimentions for a multivariate distribution | constant | ||
n/N | total number of data points or samples | constant | ||
$theta_i$ | set of variables for $i^{th}$ component distribution | |||
$x_j$ / $x$ | each sample point | |||
$w_j^i$ / $gamma(z_j^i)$ | weight/ responsibility of sample $x_j$ assigned to cluster i | expectation of the assignment variable z; $w_j^i = P(z_j^i=1 | x_j,hattheta) = E(z_j^i | x_j,hattheta)$ |
M | total categories for multinomial distribution | |||
m | the $m^{th}$ category in multinomial mixture model |
- Mixture proportion and mixture component
For a single sample $x$, the likelihood of it coming from the mixture model parameterized by $theta$ equals the likelihood of $x$ coming from each distribution (mixture component) $times$ (mixture proportion) $phi$
$color{DarkTurquoise}{I am here to try some color}$
- Clustering:
If you get a new data sample $x^*$, decide it should go to which component: write out the posterior and apply Beyesian rule
- Likelihood
The likelihood (L) and the log likelihood ($l$) of a set of data $N = {x_1, . . . , x_n}$ are as follow:
(leave out $z_i$, as it doesn’t make a difference in calculation and is easier to understand)
The parameters to estimate are $theta$ and $phi$. Solve by minimizing the log likelihood.
2. How to Solve Mixture Model?
Try to take the derivative of log-likelihood with respect to one parameter $theta_m$ (the $m^{th}$ component). Note: $f(x_j;theta_m) = P(x_j^i | theta_i)$ |
We are doing weighted maximum likelihood, with weights given by posterior cluster probablities. These to repeat, depend on the parameters ($theta$) we are trying to estimate. —- A vicious circle. [^stats.cmu]
Introduction to EM Algorithm
Alternating between assigning points to clusters and finding cluster centers. —- K-Means
Find out which component each point is most likely to have come from; re-estimate the components using only the points assigned to it. —-EM
Algorithm steps:
Steps | K-Means | EM |
---|---|---|
1 | set k; choose initial centroids | set k; initialize $theta_1…theta_k$, $phi_1…phi_k$ |
2 | assign data $x_j$ to a cluster i: $z_j^i=1$ by a distance function | E: Using the current parameter guesses, calculate the weight (soft assignment variable $w =gamma(z)$ : the posterior probablity of $x_j$ in each cluster ) |
3 | calculate the positions of centroids $mu_i=frac{sum_{j=1}^n(z_j^ix_j)}{sum_{j=1}^n(z_j^i)}$ | M: Using the current weights, maximize the weighted likelihood to get new parameter estimates. —- For each component, take the sample by portion assigned to it, then do parameter estimation as in basic distributions. And then recalculate $phi$ $color{Red}{why phi calculated this way}$ |
4 | iterate 2 and 3 until the assignment, z, no longer change | Iterate 2 and 3 until the log-likelihood is below some threshold. Return the final parameter estimates (including mixing proportions $phi$) and cluster probabilities: $w$ for each sample |
Charasteristics:
K-Means | EM |
---|---|
largely dependent on initial assignment, no guarantee | guarentee to converge to local optimal |
Calculation of EM algorithm (examplified with Gaussian Mixture)
- Initialize the $theta_k$: $( mu_k, Sigma_k),phi_k$ and evaluate the initial value of the log likelihood.
-
E step Evaluate the responsibilities/ weight using the current parameter values
$color{Red}{connection to naive bayes?}$
$color{Red}{Yes!There is mixtures of naive bayes}$ - M step Re-estimate the parameters using the current responsibilities
$N_k = sum_{j=1}^n gamma(z_j^k)$
- Evaluate the log likelihood
And check for convergence of either the parameters or the log likelihood.
Bonous: Proof and Details about EM Algorithm
Read section: more about EM algorithm
3. (Multivariate) Berboulli/ Multinomial Mixture models 1
https://users.ics.aalto.fi/jhollmen/Publications/courseware.pdf
mixture model summary
Bernoulli Mixture model:
- parttern recognition: hand-written letter recognition
http://users.dsic.upv.es/~ajuan/research/2004/Juan04_08b.pdf - finding motifs in sequence data (eg. DNA)
Consider a multivariate Bernoulli distribution:
A set of D discrete variables $x_d$, where d = 1, … D, each of which is
governed by a Bernoulli distribution with parameter $mu_i$, so that
$p(x_1,ldots x_D|mu_1 ldots mu_D) = prod_{d=1}^{D}
mu_d^{x_d}(1-mu_d)^{(1-x_d)} $
E[x] = $mu$
cov[x] = diag{$mu_d(1-mu_d)$}
$color{DarkTurquoise}{——}$
- Be careful, for a multivariate distribution, the mean is a d-dim vector and covariance is an d by d matrix.
<font color=blue> A Problem: cov here can’t really capture the correlation between different dimentions. (If we consider each $xd$ represents a pixel in a picture and therfore the chance of getting all pixels taking on values exactly the same as they are in this picture is a multinomial Bernoulli distribution)
</font>
Solve: To capture the interpixel correlation: consider a finite mixture of these distributions given by $p(x|mu,phi) = sum{k=1}^K phi_k p(x|mu_k)$
The mean and covariance of this mixture distribution are given by:
E[x] = $sum_{k=1}^K phi_kmu_k$
cov[x] = $phi_k { Sigma_k + mu_kmuk^T} - E[x]E[x]^T$ where $Sigmak = diag{mu{kd} (1-mu{kd})}$
$color{Red}{Task}$
[Task: Review and check cov matrix calculations for general and Bernoulli and Multinomial distribution]
Problem solved: The cov matrix is no longer diagonal so the mixture distribution can capture correlations between variables unlike single Bernoulli distribution.
Bernoulli Mixture Calculation EM:
Derive the EM algorithm with the below log likelihood for mixture of Bernoulli distributions
The complete data likelihood:
TRICK: $ln (sum_{k=1}^K zj^k ()) = sum{k=1}^K z_j^k ln()$ since only one $z^k$ = 1 and 0 elsewhere
The complete data log likelihood:
Take expectation of the complete-data log likelihood with respect to posterior distribution of latent variables $mathbf{z}$:
where $gamma (z{jk}) = E[z{jk}]$
E step:
Consider sum over j in $E_z$ equation, responsibilities enter only through two terms:
M step: Maximize the expected complete-data log likelihood with respect to $mu_k$ and $phi$ by setting the derivative w.r.t to $mu_k$ *That’s why the lengthy $E_z[ln p(x,z | mu,phi)$ need to be calculated* |
Obtain
w.r.t. $phi$, a Lagrange multiplier to enforce constraint $sum_k phi_k =1$
Obtain
Extend to Multinomial Mixture Model:
Consider a D-dimensional variable x each of whose components i is itself a multinomial variable of degree M so that x is a binary vector with components $x_{dm}$ where d = 1,…,D and m = 1,…,M,subject to the constraint that $sum_m x_{dm} = 1$ for all i. Suppose that the distribution of these variables is described by a mixture of the discrete multinomial distributions so that
where
Note:
x = [$mathbf{x_1},…mathbf{x_D}$] and $mathbf{x_d} = [x_{d1},…x_{dM}]^T$
$mu_{kdm} = p(x_{dm} = 1 | mu_k)$
$mu_k = [mu_{k1} … mu_{kD}]$ and $mu_{kd} = [mu_{kd1},…,mu_{kdM}]$ constrained by
0 $<=mu_{kdm} <=$ 1 and $sum_mmu_{kdm} = 1$(to ensure multinomial distribution for each component & dimention)
Derive mixing coefficients $phi_k$ and the component parameters $mu_{kdm}$:
- Latent variable $Z_n$ corresponding to each observation.
$color{Red}{TODO: Derivation not understood}$
E step
M step
For clustering tasks, the $gamma(z{n}) = [gamma(z{1})…gamma(z_k)]$ is used.
Clustering with Both Categorical and Numeric Values
Difference in EM lies in $theta$, and $p(x_n|theta_k)$.
General Form:
E-step
Find $gamma(z_{nk}) = frac{phi_k p(x_n|thetak)}{sum{k=1}^K phi_k p(x_n|theta_k)}$
M-step
Estimate parameter $theta$ using maximum likelyhood estimation
Details
E-step
General:
Assume there is no dependency between numeric columns and categorical columns.
For Gaussian Distribution independently in different dimensions:
If the numeric dimensions are modeled by multivariate Gaussian Distribution:
M-step
Gauusian:
Multinomial:
Look for variance and covariance http://wcms.inf.ed.ac.uk/ipab/rlsc/lecture-notes/RLSC-Lec3.pdf
Image: perhaps https://pdfs.semanticscholar.org/6b8e/8ffc6d9ef96fe61bcc92152e1f63ed4c0d59.pdf
4. How to Code Mixture Models?
Python pomegrante library
R mixtools: mixture of multinomials
5. Extentions of Mixture Models
Conjugate Prior: To get a reasonable prior $p(theta)$
Introducing prior distribution: similar as getting more valid observations of x, maximize prior probablity w.r.t $theta$
Then maximize posterior probablity (likelihood) of $theta$
$p(theta|x)$ in same distribution family with $p(theta)$
Distribition | Conjuate Prior | Process Involved |
---|---|---|
Berboulli | Beta | |
Multinomial | Dirichlet | MCMC |
Gaussian |
Resourses:
- http://home.iitk.ac.in/~apanda/cs300/5.pdf
- https://www.mrc-bsu.cam.ac.uk/wp-content/uploads/EM_slides.pdf
- https://people.duke.edu/~ccc14/sta-663/EMAlgorithm.html
- file:///C:/Users/dell/Desktop/10.1007%252F978-0-387-35768-3.pdf
- http://www.rmki.kfki.hu/~banmi/elte/bishop_em.pdf ↩
近期评论