[notes on mathematics for esl] chapter 2: overview of supervised learning

2.4 Statistical Decision Theory

Derivation of Equation (2.16)

The expected predicted error (EPE) under the squared error loss:

Taking derivatives with respect to $beta$:

In order to minimize the EFE, we make derivatives equal zero which gives Equation (2.16):

Note: $x^Tbeta$ is a scalar, and $beta$ is a constant.

2.5 Local Methods in High Dimensions

Intuition on Equation (2.24)

There are $N$ $p$-dimensional data point $x_1,dots, x_N$, that is, $Ntimes p$ dimensions in total. Let $r_i=Vert x_i Vert$. Without loss of generality, we assume that $A < r_1 < dots < r_n < 1$. Let $U(A)$ be the region of all possible sampled data which meet the assumptation:

The goal is to find $A$ such that $U(A)=frac12U(0)$. It turns out to be a integration problem on a $N times p$ dimensional space.

With some mathematical techniques (which make me overwhelmed), we can get $U(A)=(1-A^p)^N$. Then $U(0)=1$. Solving $(1-A^p)^N=1/2$, we obtain Equation (2.24):

Derivation of Equation (2.27) and (2.28)

The variation is over all training sets $mathcal{T}$, and over all values of $y_0$, while keeping $x_0$ fixed. Note that $x_0$ and $y_0$ are chosen independently of $mathcal{T}$ and so the expectations commute:
$mathrm{E}_{y_0vert x_0}mathrm{E}_{mathcal{T}}=mathrm{E}_{mathcal{T}}mathrm{E}_{y_0 vert x_0}$.
Also $mathrm{E}_mathcal{T}=mathrm{E}_mathcal{X}mathrm{E}_{mathcal{Y vert X}}$.

In order to make the derivation more comprehensible, here lists some definitions:

$y_0-hat y_0$ can be written as the sum of three terms:

Following above definitions, we have $U_1=varepsilon$, $U_3=0$. In addition, clearly we have $mathrm{E}_mathcal{T}U_2=0$. When squaring $U_1-U_2-U_3$, we can eliminate all three cross terms and one squared terms $U_3^2$.

Following the definition of variance, we have: $mathrm{E}_{y_0vert x_0}mathrm{E}_mathcal{T}U_1^2=mathrm{Var}(varepsilon)=sigma^2$ and $mathrm{E}_mathcal{T}(hat y_0 - mathrm{E}_mathcal{T}hat y_0)^2=mathrm{Var}_mathcal{T}(hat y_0)$.

Since $U_2=sum_{i=1}^Nl_i(x_0)varepsilon_i$, we have $mathrm{Var}_mathcal{T}(hat y_0)=mathrm{E}_mathcal{T}U_2^2$ as

Since $mathrm{E}_mathcal{T}varepsilonvarepsilon^T=sigma^2I_N$, this is equal to $mathrm{E}_mathcal{T}x_0(X^TX)^{-1}x_0sigma^2$. This completes the derivation of Equation (2.27).

Under the conditions stated by the authors, $X^TX/N$ is then approximately equal to $mathrm{Cov}(X)=mathrm{Cov}(x_0)$. Applying $mathrm{E}_{x_0}$ to $mathrm{E}_mathcal{T}x_0(X^TX)^{-1}x_0sigma^2$, we obtain (approximately)

This completes the derivation of Equation (2.28).

References