[notes on mathematics for esl] chapter 2: overview of supervised learning

2.4 Statistical Decision Theory

Derivation of Equation (2.16)

The expected predicted error (EPE) under the squared error loss:

$mathrm{EPE}(beta) = int (y-x^Tbeta)^2Pr(dx, dy).$

Taking derivatives with respect to $beta$:

$% <![CDATA[ begin{split} frac{partialmathrm{EFE}}{partialbeta}&=-2int(y-x^Tbeta)xPr(dx, dy). \ &= -2(E[yx]-E[xx^Tbeta]) end{split} %]]&gt;$

In order to minimize the EFE, we make derivatives equal zero which gives Equation (2.16):

$beta=E[xx^T]^{-1}E[yx].$

Note: $x^Tbeta$ is a scalar, and $beta$ is a constant.

2.5 Local Methods in High Dimensions

Intuition on Equation (2.24)

There are $N$ $p$-dimensional data point $x_1,dots, x_N$, that is, $Ntimes p$ dimensions in total. Let $r_i=Vert x_i Vert$. Without loss of generality, we assume that $A < r_1 < dots < r_n < 1$. Let $U(A)$ be the region of all possible sampled data which meet the assumptation:

$% <![CDATA[ U(A) = int_{A<r_1<dots<r_n<1}dx_1dots dx_N. %]]&gt;$

The goal is to find $A$ such that $U(A)=frac12U(0)$. It turns out to be a integration problem on a $N times p$ dimensional space.

With some mathematical techniques (which make me overwhelmed), we can get $U(A)=(1-A^p)^N$. Then $U(0)=1$. Solving $(1-A^p)^N=1/2$, we obtain Equation (2.24):

$A=left(1-2^{-frac1N}right)^{frac1p}.$

Derivation of Equation (2.27) and (2.28)

The variation is over all training sets $mathcal{T}$, and over all values of $y_0$, while keeping $x_0$ fixed. Note that $x_0$ and $y_0$ are chosen independently of $mathcal{T}$ and so the expectations commute:
$mathrm{E}_{y_0vert x_0}mathrm{E}_{mathcal{T}}=mathrm{E}_{mathcal{T}}mathrm{E}_{y_0 vert x_0}$.
Also $mathrm{E}_mathcal{T}=mathrm{E}_mathcal{X}mathrm{E}_{mathcal{Y vert X}}$.

In order to make the derivation more comprehensible, here lists some definitions:

$% <![CDATA[ begin{split} y_0 &= x_0^Tbeta + varepsilon \ hat y_0 &= x_0^Tbeta + sum_{i=1}^Nl_i(x_0)varepsilon_i \ mathrm{E}_mathcal{T}(hat y_0) &= x_0^Tbeta + mathrm{E}_{mathcal{T}}((X^TX)^{-1}Xx_0varepsilon) \ &= x_0^Tbeta+x_0^Tmathrm{E}_mathcal{X}((X^TX)^{-1}X^Tmathrm{E}_{mathcal{Y|X}}varepsilon) \ &=x_0^Tbeta end{split} %]]&gt;$

$y_0-hat y_0$ can be written as the sum of three terms:

$(y_0-x_0^Tbeta)-(hat y_0-mathrm{E}_mathcal{T}(hat y_0))-(mathrm{E}_mathcal{T}(hat y_0)-x_0^Tbeta)=U_1-U_2-U_3$

Following above definitions, we have $U_1=varepsilon$, $U_3=0$. In addition, clearly we have $mathrm{E}_mathcal{T}U_2=0$. When squaring $U_1-U_2-U_3$, we can eliminate all three cross terms and one squared terms $U_3^2$.

Following the definition of variance, we have: $mathrm{E}_{y_0vert x_0}mathrm{E}_mathcal{T}U_1^2=mathrm{Var}(varepsilon)=sigma^2$ and $mathrm{E}_mathcal{T}(hat y_0 - mathrm{E}_mathcal{T}hat y_0)^2=mathrm{Var}_mathcal{T}(hat y_0)$.

Since $U_2=sum_{i=1}^Nl_i(x_0)varepsilon_i$, we have $mathrm{Var}_mathcal{T}(hat y_0)=mathrm{E}_mathcal{T}U_2^2$ as

$mathrm{E}_mathcal{T}(x_0^T(X_TX)^{-1}X^Tvarepsilonvarepsilon^TX(X^TX)^{-1}x_0).$

Since $mathrm{E}_mathcal{T}varepsilonvarepsilon^T=sigma^2I_N$, this is equal to $mathrm{E}_mathcal{T}x_0(X^TX)^{-1}x_0sigma^2$. This completes the derivation of Equation (2.27).

Under the conditions stated by the authors, $X^TX/N$ is then approximately equal to $mathrm{Cov}(X)=mathrm{Cov}(x_0)$. Applying $mathrm{E}_{x_0}$ to $mathrm{E}_mathcal{T}x_0(X^TX)^{-1}x_0sigma^2$, we obtain (approximately)

$% <![CDATA[ begin{split} sigma^2mathrm{E}_{x_0}(x_0^Tmathrm{Cov}(X)^{-1}x_0)/N &= sigma^2mathrm{E}_{x_0}(mathrm{trace}(x_0^Tmathrm{Cov}(X)^{-1}x_0))/N \ &= sigma^2mathrm{E}_{x_0}(mathrm{trace}(mathrm{Cov}(X)^{-1}x_0x_0^T))/N \ &= sigma^2mathrm{trace}(mathrm{Cov}(X)^{-1}mathrm{C ov}(x_0))/N \ &= sigma^2mathrm{trace}(I_p)/N \ &= sigma^2p/N. end{split} %]]&gt;$

This completes the derivation of Equation (2.28).

References

Notes on The Elements of Statistical Learning (by John Weatherwax)

[notes on mathematics for esl] chapter 2: overview of supervised learning

2.4 Statistical Decision Theory

Derivation of Equation (2.16)

2.5 Local Methods in High Dimensions

Intuition on Equation (2.24)

Derivation of Equation (2.27) and (2.28)

References

近期文章

近期评论

标签

热门

文章归档

分类目录

功能