R nanocourse 7: Multiple Linear Regression

Introduction

In this session, you will explore linear regressions in detail, focusing on multiple linear regressions. This is a follow-up of last session’s topic, i.e. simple linear regression. This laboratory is part of the analysis part of the research process (CVA approch).

Goals

At the end of the lab, you should be able to:

  1. produce a multiple linear regression;
  2. interpret the obtained results;
  3. use interaction terms;
  4. showcase your results with the Stargazer package.

In the last section of this session, you will find additional materials to go a step further:

  1. evaluation your variables in a correlation matrix;
  2. visualize your results in a 3D plot.

Description of the dataset

In this laboratory, we will use the same database as in the R nanocourse #6. This dataset concerns the housing price in the Boston area. It gathers 506 observations (rows) and 14 different columns, detailled as follow (Harrison and Rubinfeld, 1978; Belsley et al., 1980):

  • crim: per capita rate by town;
  • xn: proportion of residential land zoned for lots over 25K sq.ft.;
  • indus: propostion of non-retail business acres per town;
  • chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise);
  • nox: nitrogen oxides concentration (parts per 10 million);
  • rm: average number of rooms per dwelling;
  • age: proportion of owner-occupied units built prior to 1940;
  • dis: weighted mean of distances to five Boston employment centres;
  • rad: index of accessibility to radial highways;
  • tax: full-value property-tax rate per $10,000;
  • ptratio: pupil-teacher ratio by town;
  • black: 1000(Bk − 0.63)2 where Bk is the proportion of blacks by town;
  • lstat: lower status of the population (percent);
  • medv: median value of owner-occupied homes in $1000s.
# Loading libraries
library(MASS)
library(ISLR)
library(stargazer)
library(gsheet)

# Assessing data from Google Drive and inserting these data in a dataframe named "Boston"
Boston <- gsheet2tbl("https://docs.google.com/spreadsheets/d/1mgd9xIhKuy_M_JhR_Kbv564Q_5VwmtT6CkHNWdThvRU/edit?usp=sharing")

# Name of all variables in the dataset
summary(Boston)
##       crim                xn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Multiple Linear Regression

Let’s find the correlation between the median value of owner-occupied homes (medv, in thousands of dollars) and two variables. This time, we will focus on (1) the pupil-teacher ratio by town (ptratio) and (2) the fact that the house is located near Charles River (chas).

This econometric model (model 1) is specified as follow:

\[medv = \alpha_0 + \alpha_1 * ptratio + \alpha_2 * chas + \epsilon\]

With medv as the dependent variable; ptratio and chas as the independent variables. Let’s remember that ptratio is a continuous variable ranging from 12.6 to 22, and that chas is a dummy variable, taken the value of \(1\) if the property is next to Charles River, and \(0\) otherwise.

# Multiple linear regression
model1 <- lm(medv ~ ptratio + chas, data = Boston)

Now, by calling the summary of model1, you will see the result of the regression.

# Results of the model
summary(model1)
## 
## Call:
## lm(formula = medv ~ ptratio + chas, data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20.8955  -4.8524  -0.7008   3.3097  31.4152 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  60.9579     3.0406  20.048  < 2e-16 ***
## ptratio      -2.0977     0.1629 -12.874  < 2e-16 ***
## chas          4.1735     1.3889   3.005  0.00279 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.869 on 503 degrees of freedom
## Multiple R-squared:  0.2709, Adjusted R-squared:  0.268 
## F-statistic: 93.46 on 2 and 503 DF,  p-value: < 2.2e-16

Now we can interpret such model in three points for each independent variable.

Concerning the relation between the ratio of pupil to teacher (ptratio) and the median value of owner-occupied homes (medv):

  • the relation between the two variables is statistically significant (as could be seen with the 3 stars on the far right of the line concerning ptratio);
  • the sign of the estimate of ptratio is negative: i.e. when ptratio is increasing, mdev is decreasing, and vice-versa;
  • the coefficent between the two variables in about -2.09.

Concerning the relation between the location close to Charles River (chas) and the median value of owner-occupied homes (medv):

  • the relation between the two variables is statistically significant (as could be seen with the 3 stars on the far right of the line concerning chas);
  • the sign of the estimate of chas is positive: when chas is equals to 1, mdev is incr; if not, it is lower;
  • the coefficent between the two variables in about +4.17.

Overall, the R2 of the equation is 0.268, i.e.: both variables explain about a quarter of the variance of mev. In other words, when the ratio of pupils to teacher is increasing, the overall value of owner-occupied homes is decreasing, whereas when they are located next to Charles River, it is increasing.

Interaction terms

Now, let’s go a step further. Once we know the relation between each independent variables to the dependent variable, it would be interesting to evaluate the relation between homes located next to Charles River and their price, considering the ratio of pupils to teacher. This is assessed by using an interaction term.

The econometric model (model2) is specified as follow:

\[medv = \alpha_0 + \alpha_1 * ptratio + \alpha_2 * chas + \alpha_3 * ptratio * chas + \epsilon\]

# Multiple linear regression with interaction terms
model2 <- lm(medv ~ ptratio * chas, data = Boston)

Now, by calling the summary of model2, you will see the result of the regression.

# Results of the model
summary(model2)
## 
## Call:
## lm(formula = medv ~ ptratio * chas, data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.7697  -4.9891  -0.6121   3.5019  31.6109 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   63.1253     3.1192  20.238  < 2e-16 ***
## ptratio       -2.2147     0.1672 -13.243  < 2e-16 ***
## chas         -28.3322    11.7615  -2.409  0.01636 *  
## ptratio:chas   1.8515     0.6653   2.783  0.00559 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.816 on 502 degrees of freedom
## Multiple R-squared:  0.282,  Adjusted R-squared:  0.2777 
## F-statistic: 65.73 on 3 and 502 DF,  p-value: < 2.2e-16

Now, the relation between the interaction of both variables (ptratio::chas) and the independent variable is statistically significant (2 stars); the sign of the relation is positive and its coefficient is equals to 1.85.

In other words, concerning the houses located next to Charles River, when the ptratio is increasing (i.e. when teacher have more children to take care of), the median value of homes is increasing, which is the contrary on the overall sample.

Stargazer

Using the stargazer package, it is possible to showcase your previous results in a nicely done table.

# Let's present your result with stargazer
stargazer(model1, model2, out="models1and2.htm")

tl;dr

In this section, you will find a R chunk containing every line of code seen in this laboratory. You can copy and use them directly to produce the analysis previously explained.

# Loading libraries
library(MASS)
library(ISLR)
library(stargazer)
library(gsheet)

# Assessing data from Google Drive and inserting these data in a dataframe named "Boston"
Boston <- gsheet2tbl("https://docs.google.com/spreadsheets/d/1mgd9xIhKuy_M_JhR_Kbv564Q_5VwmtT6CkHNWdThvRU/edit?usp=sharing")

# Name of all variables in the dataset
summary(Boston)

# Multiple linear regression
model1 <- lm(medv ~ ptratio + chas, data = Boston)
summary(model1)

# Multiple linear regression with interaction terms
model2 <- lm(medv ~ ptratio * chas, data = Boston)
summary(model2)

# Let's present your result with stargazer
stargazer(model1, model2, out="models1and2.htm")

A step further

Correlation matrix

Before your regression, you need to insure that your independent variables are not correlated between themselves. Otherwise, your estimations might be overestimated. To do so, let’s procede in two steps.

First, we can visualize the relation between some of the variables (here: crim, xn, indus, chas and nox) using the function pairs():

# Pair relations between variables
pairs(Boston[,1:5],col="black", size=1,pch=20)

Second, we can obtain measure the correlation coefficient between each variables using the cor() function. These results will be showcased with the corrplot() function of the corrplot library:

# Correlation matrix
library(corrplot)
correlationData = cor(Boston[1:5])
corrplot(correlationData, method = "number")

For example, in this case, nox is highly correlated to xn, indus and crim. In other words, if a regression has to be considered, nox and the three other variables should not be used together.

3D visualization

# 2D visualization
library(visreg)

visreg2d(model2, "chas", "ptratio", plot.type="image")

visreg2d(model2, "chas", "ptratio",  plot.type="persp")

tl;dr

# Pair relations between variables
pairs(Boston[,1:5])

# Correlation matrix
library(corrplot)
correlationData = cor(Boston[1:5])
corrplot(correlationData, method = "number")

# 2D visualization
library(visreg)

visreg2d(model2, "chas", "ptratio", plot.type="image")
visreg2d(model2, "chas", "ptratio",  plot.type="persp")

Quiz

For this quiz, we will use the same Google Sheet as we used for the quiz of the last nanocourse. We’ll dive even further into the correlation between the success on an exam by a Med student and the factors that might influence it. Here, we will want to make a correlation matrix with multiple variables.

  • Exam1 is the result to the first exam
  • Exam2 is the result to the second exam
  • Interest is the motivation the student as for this particular course; on 100
  • Perception is how the student perceive himself on his confidence on that course; on 100
  • gpa is the Grade Point Average of that student overall. It is based on a scale up to 4
  • mean is the result of all his exam divided by the total of exams the student participated to

Reproduce these matrix but instead of circles, find the numbers in which represents the correlation. Once again, we did not put the solution box and trust you into finding it yourself. If that’s not trust; we don’t know what trust is!

Code learned this week

Command Detail
cor() produce a correlation matrix
corrplot() visualize a correlation matrix
pairs() produce a pair graph between variables
visreg2d() 2D visualization of multiple linear regression

References

Academic References

  • Belsley D.A., Kuh, E. and Welsch, R.E. (1980) Regression Diagnostics. Identifying Influential Data and Sources of Collinearity. New York: Wiley.
  • Breheny, P. and Burchett, W. (2016) Visualization of Regression Models Using visreg. Retrieved from: http://myweb.uiowa.edu/pbreheny/publications/visreg.pdf
  • Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. J. Environ. Economics and Management 5, 81–102.
  • James, G., Witten, D., Hastie, T. and Tibshirani, R. (2013). An Introduction to Statistical Learning with Applications in R. Springer.

Packages

Acknowledgments

To cite this course:

Warin, Thierry. 2020. “SKEMA Quantum Studio: R Nanocourses.” doi:10.6084/m9.figshare.11842416.v1.