principle component analysis

Abstract

In this post,
we introduce the math foundation behind principal component analysis,
a simple technique for dimensionality reduction.

Reference

yang’s post (written in Chinese)

Diagonalizable matrix

The efficiency of features

The high-level idea of principal component analysis is to reduce the dimensionality of a given dataset
by transforming it into a new dataset in which all variables (features) are independent of each other.
We denote this transformation as follow:

$Y = PX tag{0}$

where $X in R^{m times n}$ is the original dataset with $m$ instances and $n$ variables,

$P in R^{n times r}$ is the reduction matrix,
and $Y in R^{m times r}$ is the transformed dataset with $r$ variables.
To better understand how information was stored in our dataset matrix $X$ ,
we can use the concept of variance and covariance of feature variables in dataset matrix $X$ .
Let’s denote $v_a$ as a variable in the variable vector $vin R^n$ . The variance of $v_a$ is:
$mathrm{Var}(v_a) = frac{1}{m} sum_{i=1}^{m} (v_a^i - E[v_a])^2 tag{1}$
The magnitude of the variance can be used to indicate the volume of information embedded in this variable.
A variable contains no information is a constant with zero variance.
Covariance, as another measure, indicate the level of dependency between two variables:
$mathrm{Cov}(v_a, v_b) = frac{1}{m} sum_{i=1}^{m} (v_a^i - E[v_a])(v_b^i - E[v_b]) tag{2}$
To represent both variance and covariance in a matrix form, we can use the concept of covariance matrix.
Let’s denote the covariance matrix of the random variables vector $v$ as $C$ :
$% <![CDATA[ begin{align} C &= begin{bmatrix} mathrm{Var}(v_1) & mathrm{Cov}(v_1, v_2) & dots & mathrm{Cov}(v_1, v_n) \ mathrm{Cov}(v_2, v_1) & mathrm{Var}(v_2) & dots & mathrm{Cov}(v_2, v_n) \ vdots & vdots & ddots & vdots \ mathrm{Cov}(v_n, v_1) & mathrm{Cov}(v_n, v_2) & dots & mathrm{Var}(v_n) end{bmatrix} tag{3} \ end{align} %]]&gt;$
Where the diagonal of $C$ are variance of each variable,
and the non-diagonal elements are covariances of any two variables.
Noted that if we normalized $X$ by making all $E[v_i] = 0$ , our transformed $X$ still contains the same information;
therefore, our covariance matrix can be simplified as follow:
$% <![CDATA[ begin{align} C &= begin{bmatrix} frac{1}{m} sum_{i=1}^{m} (v_1)^2 & frac{1}{m} sum_{i=1}^{m} v_1 v_2 & dots & frac{1}{m} sum_{i=1}^{m} v_1 v_n \ frac{1}{m} sum_{i=1}^{m} v_2 v_1 & frac{1}{m} sum_{i=1}^{m} (v_2)^2 & dots & frac{1}{m} sum_{i=1}^{m} v_2 v_n \ vdots & vdots & ddots & vdots \ frac{1}{m} sum_{i=1}^{m} v_n v_1 & frac{1}{m} sum_{i=1}^{m} v_n v_2 & dots & frac{1}{m} sum_{i=1}^{m} (v_n)^2 end{bmatrix} tag{3} \ &= frac{1}{m} X^T X tag{4} end{align} %]]&gt;$

The objective of principal component analysis

From (3), we know that $C$ is the covariance matrix associated with the original dataset $X$ .
For the transformed dataset $Y = XP$ ,
we can also obtain a covariance matrix $D in R^{r times r}$ using the same procedural as (4).
Interestingly, $D$ has can be rewritten with only $P$ and $C$ :
$% <![CDATA[ begin{align} D &= frac{1}{m} Y^T Y \ &= frac{1}{m} (XP)^T (XP) \ &= frac{1}{m} P^T X^T(XP) \ &= P^T frac{1}{m} X^T X P \ &= P^T C P tag{5} end{align} %]]&gt;$
Since $D$ is a covariance matrix that contains variances and covariances of the new feature variables vector,
our objective is to find a $P$ such that any two random variables of this new feature variables vector have zero covariance.
In addition, because $D$ has a smaller dimension than $C$ ,
we also want this new feature variables vector to retain as much information as possible from the original dataset.
Given $r$ real values such that $lambda_1 geq lambda_2 dots lambda_r$ ,
we want D to be a diagonal covariance matrix:
$% <![CDATA[ begin{align} D &= begin{bmatrix} lambda_1 & 0 & dots & 0 \ 0 & lambda_2 & dots & 0 \ vdots & vdots & ddots & vdots \ 0 & 0 & dots & lambda_r end{bmatrix} tag{6} \ end{align} %]]&gt;$
This process of finding $P$ that turns D into a diagonal matrix is called diagonalization in linear algebra (see Diagonalizable matrix).
It involves eigendecomposition of matrix $C$ , which happens to be a real symmetric matrix in this case.
So if we apply eigendecomposition on $C$ , we get:
$% <![CDATA[ begin{align} C &= P D P^T tag{7} end{align} %]]&gt;$
where $P$ is an orthogonal matrix.

Relationship with SVD

It’s also common to solve PCA problem with singular value decomposition. First, let’s apply SVD on $X$ :
$% <![CDATA[ begin{align} X &= U Sigma V^T tag{8} end{align} %]]&gt;$
For $X^T X$ , we have:
$% <![CDATA[ begin{align} X^T X &= (U Sigma V^T)^T U Sigma V^T \ &= V Sigma^T U^T U Sigma V^T \ &= V Sigma^T Sigma V^T tag{$U^T U = I$} \ &= V Sigma Sigma V^T tag{$Sigma^T = Sigma$, because $Sigma$ is a symmetric matrix} \ &= V Sigma^2 V^T tag{9} end{align} %]]&gt;$
Given (7) and (9), it’s not difficult to see that $P = V$ and $m D = Sigma^2$ , where $m$ is a constant.
This means we can solve PCA by only applying SVD on $X$ , without calculating $X^T X$ .

principle component analysis

Abstract

Reference

The efficiency of features

The objective of principal component analysis

Relationship with SVD

近期文章

近期评论

标签

热门

文章归档

分类目录

功能