R nanocourse 6: Simple Linear Regression

Introduction

In this session, you will learn the basic step to produce your first econometric analysis with R. Two models will be explained, the simple and the multiple linear regression. This is part of the C-V-A approach, and from now (and the next laboratories) we will explore the -A-, namely the Analysis part of the process.

Goals

At the end of the session, you should be able to:

  1. produce a simple linear regression on a dataset;
  2. visualize the relation between two variables;
  3. explore a multiple regression.

In the last section of this session, you will find additional materials to go a step further.

Regressions

Simple linear regression

First, let’s download some data. The new dataset is about housing price in the Boston area.

# Loading libraries
library(MASS)
library(ISLR)
library(stargazer)
library(gsheet)

# Assessing data from Google Drive and inserting these data in a dataframe named "Boston"
Boston <- gsheet2tbl("https://docs.google.com/spreadsheets/d/1mgd9xIhKuy_M_JhR_Kbv564Q_5VwmtT6CkHNWdThvRU/edit?usp=sharing")

# Name of all variables in the dataset
head(Boston)

It gathers 506 observations (rows) and 14 different columns, detailled as follow (Harrison and Rubinfeld, 1978; Belsley et al., 1980):

  • crim: per capita rate by town;
  • zn: proportion of residential land zoned for lots over 25K sq.ft.;
  • indus: propostion of non-retail business acres per town;
  • chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise);
  • nox: nitrogen oxides concentration (parts per 10 million);
  • rm: average number of rooms per dwelling;
  • age: proportion of owner-occupied units built prior to 1940;
  • dis: weighted mean of distances to five Boston employment centres;
  • rad: index of accessibility to radial highways;
  • tax: full-value property-tax rate per $10,000;
  • ptratio: pupil-teacher ratio by town;
  • black: 1000(Bk − 0.63)2 where Bk is the proportion of blacks by town;
  • lstat: lower status of the population (percent);
  • medv: median value of owner-occupied homes in $1000s.

Let’s investigate the relationship between two variables, namely the median value of owner-occupied homes in thousands of dollars (medv) and the lower status of the population in percent (lstat). To do so, you need to use the lm() function.

The first argument of the function is your econometric equation:

\[medv = \alpha_0 + \alpha_1 * lstat + \epsilon\]

This will be our first model, named model1. Let’s create a variable with this name (model1) and assign econometric equation with the lm() function. In the first part, you need to specify your dependent variable (medv) and then your independent variable (lstat). The second part of the function is the name of the dataset that will be used.

# Linear regression (lm stands for linear model)
model1 <- lm(medv ~ lstat, data = Boston)

Now, by calling the summary of model1, you will see your results.

# Results of the model
summary(model1)
## 
## Call:
## lm(formula = medv ~ lstat, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.168  -3.990  -1.318   2.034  24.500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.55384    0.56263   61.41   <2e-16 ***
## lstat       -0.95005    0.03873  -24.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared:  0.5441, Adjusted R-squared:  0.5432 
## F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16

Let’s interpret such model in only four points:

  1. the relation between the two variables is statistically significant (as could be seen with the 3 stars on the far right of the line concerning lstat);
  2. the sign of the estimate of lstat is negative: i.e. when lstat is increasing, mdev is decreasing, and vice-versa;
  3. the coefficent between the two variables in about 0.95;
  4. the R2 of the equation is 0.5432, i.e.: lstat explains about half of the variance of mev.

To understand the relation between these two variables, it may be useful to visualize in a graph the results. Let’s call the function plot() to first draw each observation.

# Present the relation between variables
plot(Boston$lstat,Boston$medv, pch=20)

Now, we will add the regression line to the previous graph to fully caracterize the relationship of these two variables.

# Add the regression line to the previous plot
plot(Boston$lstat,Boston$medv, pch=20)
abline(model1, lwd=3, col="red")

Multiple linear regression

In this section, we will start from the same database, but produce a second model (model2). In this model, we want to explain the same dependent varaible (medv) but this time we will add more than 1 independent variable (this time: lstat and age). Such models are called multiple linear regression and are specified as follow:

\[medv = \alpha_0 + \alpha_1 * lstat + \alpha_2 * age + \epsilon\]

# Perform a multiple linear regression
model2 <- lm(medv ~ lstat + age, data = Boston)
summary(model2)
## 
## Call:
## lm(formula = medv ~ lstat + age, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.981  -3.978  -1.283   1.968  23.158 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.22276    0.73085  45.458  < 2e-16 ***
## lstat       -1.03207    0.04819 -21.416  < 2e-16 ***
## age          0.03454    0.01223   2.826  0.00491 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.173 on 503 degrees of freedom
## Multiple R-squared:  0.5513, Adjusted R-squared:  0.5495 
## F-statistic:   309 on 2 and 503 DF,  p-value: < 2.2e-16

tl;dr

In this section, you will find a R chunk containing every line of code seen in this laboratory. You can copy and use them directly to produce the analysis previously explained.

# Loading libraries
library(MASS)
library(ISLR)
library(stargazer)
library(gsheet)

# Assessing data from Google Drive and inserting these data in a dataframe named "Boston"
Boston <- gsheet2tbl("https://docs.google.com/spreadsheets/d/1mgd9xIhKuy_M_JhR_Kbv564Q_5VwmtT6CkHNWdThvRU/edit?usp=sharing")

# Linear regression
model1 <- lm(medv ~ lstat, data = Boston)

# Results of the model
summary(model1)

# Present the relation between variables
plot(Boston$lstat,Boston$medv, pch=20)
abline(model1, lwd=3, col="red")

# Perform a multiple linear regression
model2 <- lm(medv ~ lstat + age, data = Boston)
summary(model2)

A step further

The linear regression function contains a lot of information. Let’s use the names() function after producing model1.

# Linear regression
model1 <- lm(medv ~ lstat, data = Boston)

# List of all available commands from the linear regression
names(model1)
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"

We can compute the confidence intervals of the relation between both variables using the confint() function, specified as follow:

# Confidence intervals of the estimate
confint(model1)
##                 2.5 %     97.5 %
## (Intercept) 33.448457 35.6592247
## lstat       -1.026148 -0.8739505

It is possible to predict such confidence intervals based on specific values of your independent variable. Now, we will compute the confidence intervals and the prediction interval of medv depending on three values of lstat, i.e.: 5, 10 and 15.

# Confidence intervals based on specific values 
predict(model1, data.frame(lstat=c(5,10,15)), interval = "confidence")
##        fit      lwr      upr
## 1 29.80359 29.00741 30.59978
## 2 25.05335 24.47413 25.63256
## 3 20.30310 19.73159 20.87461
# Prediction intervals based on specific values 
predict(model1, data.frame(lstat=c(5,10,15)), interval = "prediction")
##        fit       lwr      upr
## 1 29.80359 17.565675 42.04151
## 2 25.05335 12.827626 37.27907
## 3 20.30310  8.077742 32.52846

A more reproducible way to visualize your results is to include them in a stargazer table. By using the package stargazer, you will produce preformatted tables.

# Let's do a second model, model2
model2 <- lm(medv ~ lstat + age, data = Boston)

# Let's present your result with stargazer
stargazer(model1, model2, out="models.htm")

tl;dr

# Linear regression
model1 <- lm(medv ~ lstat, data = Boston)

# List of all available commands from the linear regression
names(model1)

# Confidence intervals based on specific values 
predict(model1, data.frame(lstat=c(5,10,15)), interval = "confidence")

# Prediction intervals based on specific values 
predict(model1, data.frame(lstat=c(5,10,15)), interval = "prediction")

# Let's do a second model, model2
model2 <- lm(medv ~ lstat + age, data = Boston)

# Let's present your result with stargazer
stargazer(model1, model2, out="models.htm")

Quiz

Produce a graph with its regression line with the data from this google sheet. It shows the results of Med students on their first exam and on their second exam as well. We want to see if there could be a correlation bewteen these two.

It should give you something like that.

Use the box to reproduce the graph with the linear regression. We purposefully left out the solution box so that you have to do it yourself and start to look for answer on your own. We believe in you!

After the graph is done, find the adjusted R2 between the first exam and the second. It should tell you how much does the second exam is explained by the first.

Code learned this week

Command Detail
lm() produce a linear regression
confint() gives the confidence interval of the regression
predict(, interval =“confidence”) returns the confidence intervals for specific values of your independent variable
predict(, interval =“prediction”) returns the prediciton intervals for specific values of your independent variable
plot() produce a graph
abline() add a regression line to a graph
stargazer() nice presentation of a table

References

Academic References

  • Belsley D.A., Kuh, E. and Welsch, R.E. (1980) Regression Diagnostics. Identifying Influential Data and Sources of Collinearity. New York: Wiley.
  • Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. J. Environ. Economics and Management 5, 81–102.
  • James, G., Witten, D., Hastie, T. and Tibshirani, R. (2013). An Introduction to Statistical Learning with Applications in R. Springer.

Packages

Acknowledgments

To cite this course:

Warin, Thierry. 2020. “SKEMA Quantum Studio: R Nanocourses.” doi:10.6084/m9.figshare.11842416.v1.