obtaining the number of components from cross validation of principal components regression

Principal components (PC) regression is a common dimensionality reduction technique in supervised learning. The R lab for PC regression in James et al.’s Introduction to Statistical Learning is a popular intro for how to perform PC regression in R: it is on p256-257 of the book (p270-271 of the PDF).

As in the lab, the code below runs PC regression on the Hitters data to predict Salary:

The result of the fit can be seen by calling summary(pcr.fit):

Taken from ISLR Section 6.7.1.

When predicting the response on test data, we would choose the number of components based on which model gave the smallest cross validation (CV) error. What is strange in this lab is that they do not pull out the “optimal” number of components via code! Instead, they make a plot of CV error, visually identified the number of components giving smallest CV error (7 in this case), then predicted on the test set with 7 hardcoded. Here is their code:

This method of identifying the number of components with smallest CV error is not ideal. Here is the output of validationplot:


I think it’s actually quite hard to tell by eye that 7 principal components is the best choice! In any case, we might want a programmatic way of finding the best number of components (according to CV) so that we don’t have to stop our data modeling pipeline to make a visual check.

It turns out that we can use the RMSEP function to pull out the CV and adjCV table that we saw in summary(pcr.fit):

To get at the individual values in the table, we need to access val key and subset appropriately:

Note that we have to subtract 1, since the model with no principal components is included in the output of RMSEP(pcr.fit) as well. If we wish to get the best number of PCs based on adjusted CV instead, we could amend that first line of code to RMSEP(pcr.fit)$val[1,,].

Related