linear regression empirical analysis in r

  1. Know about data
  2. Relationship Between Price and Mpg
  3. Linear Regression
  4. Conclusion
  5. Reference

Today we will use R to establish a linear relationship between the price of a car and mpg(mileage).

Know about data

First of all, we import the auto.dta and view the first six observations for checking.

library(haven)
auto <- read_dta("auto.dta")
head(auto)
## # A tibble: 6 x 12
##   make  price   mpg rep78 headroom trunk weight length  turn displacement
##   <chr> <dbl> <dbl> <dbl>    <dbl> <dbl>  <dbl>  <dbl> <dbl>        <dbl>
## 1 AMC ~  4099    22     3      2.5    11   2930    186    40          121
## 2 AMC ~  4749    17     3      3      11   3350    173    40          258
## 3 AMC ~  3799    22    NA      3      12   2640    168    35          121
## 4 Buic~  4816    20     3      4.5    16   3250    196    40          196
## 5 Buic~  7827    15     4      4      20   4080    222    43          350
## 6 Buic~  5788    18     3      4      21   3670    218    43          231
## # ... with 2 more variables: gear_ratio <dbl>, foreign <dbl+lbl>

Relationship Between Price and Mpg

We caculate the correlation between price and mpg. Futhermore, we can draw a scatter plot to sketch their relationship.

cor(auto$price, auto$mpg)
## [1] -0.4685967
scatter.smooth(x=auto$price, y=auto$mpg, main="price ~ mpg",xlab="price",
     ylab="mpg")

img1

Draw a density plot to scan the distribution of price.

library(e1071)
plot(density(auto$price), main="Density Plot: price", ylab="Frequency", sub=paste("Skewness:", round(e1071::skewness(auto$price), 2)))  
polygon(density(auto$price), col="blue")

img2

Linear Regression

Now we find that price and mpg have negative relationship, when mpg increases, price decreases. In order to know it deeply, we use linear regression to analyze.

linearMod <- lm(price ~ mpg, data=auto)
summary(linearMod)
## 
## Call:
## lm(formula = price ~ mpg, data = auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3184.2 -1886.9  -959.8  1359.7  9669.7 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11253.06    1170.81   9.611 1.53e-14 ***
## mpg          -238.89      53.08  -4.501 2.55e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2624 on 72 degrees of freedom
## Multiple R-squared:  0.2196,	Adjusted R-squared:  0.2087 
## F-statistic: 20.26 on 1 and 72 DF,  p-value: 2.546e-05

We use log(price) to replace price and do the previous process again.

auto$lp<-log(auto$price)
scatter.smooth(x=auto$lp, y=auto$mpg, main="log(price) ~ mpg",
     xlab="log(price)", ylab="mpg")

img3

linearMod <- lm(lp ~ mpg, data=auto)
summary(linearMod)
## 
## Call:
## lm(formula = lp ~ mpg, data = auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.58486 -0.25525 -0.08595  0.22935  1.02393 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.349342   0.153489  60.912  < 2e-16 ***
## mpg         -0.033277   0.006958  -4.782 8.93e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.344 on 72 degrees of freedom
## Multiple R-squared:  0.2411,	Adjusted R-squared:  0.2305 
## F-statistic: 22.87 on 1 and 72 DF,  p-value: 8.93e-06

Conclusion

As seen in the results above, we can find that the p-values and t Statistic are statisticaly significant. But simple linear regression isn’t enough for idenifying casual effect, we will discuss it next time.

Reference