The nineth assignment of Linear Regression. The assignment is written in Rmarkdown, a smart syntax supported by RStudio helping with formula, plot visualization and plugin codes running.
most recommend: click here for html version of assignment, you can see codes as well as plots.
You may also find the PDF Version of this assignment from github. Or if you can cross the fire wall, just see below:
1 |
# read the data |
a
1 |
fit<-lm(y~x1+x2,data=dat) |
So the regression model is $Y=3995+0.0009192X_1+12.12X_2$
b
1 |
par(mfrow=c(1,2)) |
c
We use about the same scale in the two plots. In the first plot, the scatter of points around the least square line does not differ much compared to the scatter around the horizontal line. However, in the second plot, we can see that the scatter around the regression line (which is almost verticle under this scale) is significantly smaller than the scatter around the horizontal line. This tells us that X1 is of little use when X2 is in the model, while X2 can still explain a lot when X1 is present.
So perhaps X1 can be discarded.
d
1 |
# The regression functions below are required |
Summarizing from the results above, the regression function is
$[Y-(4237.47+17.04x_2)=0.000919[x_1-(263272+5348x_2)]$
Which is equivalent to $Y=3995+0.0009192X_1+12.12X_2$
2
a
1 |
newfit<-lm(y~x1+x2+x3,data=dat) |
H_0: Not an outlier; H_1: Is an outlier
We reject the null hypothesis when the studentized deleted residuals are larger than $t(1-alpha/2n;n-p-1)$
The results are below
1 |
n=52 |
So none is deemed as an outlier.
b
1 |
hatd<-hatvalues(newfit) |
According to the rule of the thumb, cases 4,16,21,22,43,44,48 are thought to be outliers.
c
1 |
attach(dat) |
Judging from the scatter plot, this prediction does not seem to involve extrapolation beyond the range of the data.
1 |
xnew<-c(300000,7.2,0) |
Using (10.29), the conclusion is the same, as stated above.
d
1 |
test<-cbind( |
It can be seen that Cook’s distance is well below 10% quantile of the corresponding F distribution which is roughly 0.26.
Judging from DEFITS, it seems that case 43 and 32 are influential.
Judging from DFDETA, it seems that cse 16,43,10,32,40 are influential.
f
1 |
cd<-cooks.distance(newfit) |
None of the cases is deemed as influential according to this criteron.
3
a
1 |
pairs(~cases+percent+holiday+labor,data=dat, |
There does not seem to be significant pairwise linear associations. X2 and X3 seem to have a relatively higher liner correlation.
b
1 |
#cases+percent+holiday |
So there is no serious multicolinearity here.
近期评论