In this session, you will explore linear regressions in detail, focusing on multiple linear regressions. This is a follow-up of last session’s topic, *i.e.* simple linear regression. This laboratory is part of the analysis part of the research process (**CVA** approch).

**At the end of the lab, you should be able to:**

- produce a multiple linear regression;
- interpret the obtained results;
- use interaction terms;
- showcase your results with the Stargazer package.

In the last section of this session, you will find **additional materials** to go a step further:

- evaluation your variables in a correlation matrix;
- visualize your results in a 3D plot.

In this laboratory, we will use the same database as in the R nanocourse #6. This dataset concerns the housing price in the Boston area. It gathers 506 observations (rows) and 14 different columns, detailled as follow (Harrison and Rubinfeld, 1978; Belsley et al., 1980):

**crim**: per capita rate by town;**xn**: proportion of residential land zoned for lots over 25K sq.ft.;**indus**: propostion of non-retail business acres per town;**chas**: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise);**nox**: nitrogen oxides concentration (parts per 10 million);**rm**: average number of rooms per dwelling;**age**: proportion of owner-occupied units built prior to 1940;**dis**: weighted mean of distances to five Boston employment centres;**rad**: index of accessibility to radial highways;**tax**: full-value property-tax rate per $10,000;**ptratio**: pupil-teacher ratio by town;**black**: 1000(Bk − 0.63)2 where Bk is the proportion of blacks by town;**lstat**: lower status of the population (percent);**medv**: median value of owner-occupied homes in $1000s.

```
# Loading libraries
library(MASS)
library(ISLR)
library(stargazer)
library(gsheet)
# Assessing data from Google Drive and inserting these data in a dataframe named "Boston"
Boston <- gsheet2tbl("https://docs.google.com/spreadsheets/d/1mgd9xIhKuy_M_JhR_Kbv564Q_5VwmtT6CkHNWdThvRU/edit?usp=sharing")
# Name of all variables in the dataset
summary(Boston)
```

```
## crim xn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
```

Let’s find the correlation between the median value of owner-occupied homes (**medv**, in thousands of dollars) and two variables. This time, we will focus on (1) the pupil-teacher ratio by town (**ptratio**) and (2) the fact that the house is located near Charles River (**chas**).

This econometric model (model 1) is specified as follow:

\[medv = \alpha_0 + \alpha_1 * ptratio + \alpha_2 * chas + \epsilon\]

With **medv** as the **dependent** variable; **ptratio** and **chas** as the independent variables. Let’s remember that ptratio is a continuous variable ranging from 12.6 to 22, and that chas is a **dummy** variable, taken the value of \(1\) if the property is next to Charles River, and \(0\) otherwise.

```
# Multiple linear regression
model1 <- lm(medv ~ ptratio + chas, data = Boston)
```

Now, by calling the summary of model1, you will see the result of the regression.

```
# Results of the model
summary(model1)
```

```
##
## Call:
## lm(formula = medv ~ ptratio + chas, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.8955 -4.8524 -0.7008 3.3097 31.4152
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 60.9579 3.0406 20.048 < 2e-16 ***
## ptratio -2.0977 0.1629 -12.874 < 2e-16 ***
## chas 4.1735 1.3889 3.005 0.00279 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.869 on 503 degrees of freedom
## Multiple R-squared: 0.2709, Adjusted R-squared: 0.268
## F-statistic: 93.46 on 2 and 503 DF, p-value: < 2.2e-16
```

Now we can interpret such model in three points for each independent variable.

Concerning the relation between the **ratio of pupil to teacher** (ptratio) and the median value of owner-occupied homes (medv):

- the relation between the two variables is
**statistically significant**(as could be seen with the 3 stars on the far right of the line concerning ptratio); - the sign of the estimate of ptratio is
**negative**: i.e. when ptratio is increasing, mdev is decreasing, and vice-versa; - the coefficent between the two variables in about
**-2.09**.

Concerning the relation between the **location close to Charles River** (chas) and the median value of owner-occupied homes (medv):

- the relation between the two variables is
**statistically significant**(as could be seen with the 3 stars on the far right of the line concerning chas); - the sign of the estimate of chas is
**positive**: when chas is equals to 1, mdev is incr; if not, it is lower; - the coefficent between the two variables in about
**+4.17**.

Overall, the **R ^{2}** of the equation is

Now, let’s go a step further. Once we know the relation between each independent variables to the dependent variable, it would be interesting to evaluate the relation between homes located next to Charles River and their price, considering the ratio of pupils to teacher. This is assessed by using an **interaction term**.

The econometric model (model2) is specified as follow:

\[medv = \alpha_0 + \alpha_1 * ptratio + \alpha_2 * chas + \alpha_3 * ptratio * chas + \epsilon\]

```
# Multiple linear regression with interaction terms
model2 <- lm(medv ~ ptratio * chas, data = Boston)
```

Now, by calling the summary of model2, you will see the result of the regression.

```
# Results of the model
summary(model2)
```

```
##
## Call:
## lm(formula = medv ~ ptratio * chas, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.7697 -4.9891 -0.6121 3.5019 31.6109
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 63.1253 3.1192 20.238 < 2e-16 ***
## ptratio -2.2147 0.1672 -13.243 < 2e-16 ***
## chas -28.3322 11.7615 -2.409 0.01636 *
## ptratio:chas 1.8515 0.6653 2.783 0.00559 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.816 on 502 degrees of freedom
## Multiple R-squared: 0.282, Adjusted R-squared: 0.2777
## F-statistic: 65.73 on 3 and 502 DF, p-value: < 2.2e-16
```

Now, the relation between the interaction of both variables (ptratio::chas) and the independent variable is statistically significant (2 stars); the sign of the relation is positive and its coefficient is equals to 1.85.

In other words, **concerning the houses located next to Charles River**, when the ptratio is increasing (i.e. when teacher have more children to take care of), the median value of homes is increasing, which is the contrary on the **overall** sample.

Using the stargazer package, it is possible to showcase your previous results in a nicely done table.

```
# Let's present your result with stargazer
stargazer(model1, model2, out="models1and2.htm")
```

In this section, you will find a R chunk containing every line of code seen in this laboratory. You can copy and use them directly to produce the analysis previously explained.

```
# Loading libraries
library(MASS)
library(ISLR)
library(stargazer)
library(gsheet)
# Assessing data from Google Drive and inserting these data in a dataframe named "Boston"
Boston <- gsheet2tbl("https://docs.google.com/spreadsheets/d/1mgd9xIhKuy_M_JhR_Kbv564Q_5VwmtT6CkHNWdThvRU/edit?usp=sharing")
# Name of all variables in the dataset
summary(Boston)
# Multiple linear regression
model1 <- lm(medv ~ ptratio + chas, data = Boston)
summary(model1)
# Multiple linear regression with interaction terms
model2 <- lm(medv ~ ptratio * chas, data = Boston)
summary(model2)
# Let's present your result with stargazer
stargazer(model1, model2, out="models1and2.htm")
```

Before your regression, you need to insure that your independent variables are not correlated between themselves. Otherwise, your estimations might be overestimated. To do so, let’s procede in two steps.

First, we can visualize the relation between some of the variables (here: **crim**, **xn**, **indus**, **chas** and **nox**) using the function pairs():

```
# Pair relations between variables
pairs(Boston[,1:5],col="black", size=1,pch=20)
```

Second, we can obtain measure the correlation coefficient between each variables using the cor() function. These results will be showcased with the corrplot() function of the corrplot library:

```
# Correlation matrix
library(corrplot)
correlationData = cor(Boston[1:5])
corrplot(correlationData, method = "number")
```

For example, in this case, **nox** is highly correlated to **xn**, **indus** and **crim**. In other words, if a regression has to be considered, **nox** and the three other variables should not be used together.

```
# 2D visualization
library(visreg)
visreg2d(model2, "chas", "ptratio", plot.type="image")
```

`visreg2d(model2, "chas", "ptratio", plot.type="persp")`

```
# Pair relations between variables
pairs(Boston[,1:5])
# Correlation matrix
library(corrplot)
correlationData = cor(Boston[1:5])
corrplot(correlationData, method = "number")
# 2D visualization
library(visreg)
visreg2d(model2, "chas", "ptratio", plot.type="image")
visreg2d(model2, "chas", "ptratio", plot.type="persp")
```

For this quiz, we will use the same Google Sheet as we used for the quiz of the last nanocourse. We’ll dive even further into the correlation between the success on an exam by a Med student and the factors that might influence it. Here, we will want to make a correlation matrix with multiple variables.

**Exam1**is the result to the first exam**Exam2**is the result to the second exam**Interest**is the motivation the student as for this particular course; on 100**Perception**is how the student perceive himself on his confidence on that course; on 100**gpa**is the Grade Point Average of that student overall. It is based on a scale up to 4**mean**is the result of all his exam divided by the total of exams the student participated to

Reproduce these matrix but instead of circles, find the numbers in which represents the correlation. Once again, we did not put the solution box and trust you into finding it yourself. If that’s not trust; we don’t know what trust is!

Command |
Detail |
---|---|

cor() | produce a correlation matrix |

corrplot() | visualize a correlation matrix |

pairs() | produce a pair graph between variables |

visreg2d() | 2D visualization of multiple linear regression |

- Belsley D.A., Kuh, E. and Welsch, R.E. (1980)
*Regression Diagnostics. Identifying Influential Data and Sources of Collinearity.*New York: Wiley. - Breheny, P. and Burchett, W. (2016)
*Visualization of Regression Models Using visreg*. Retrieved from: http://myweb.uiowa.edu/pbreheny/publications/visreg.pdf - Harrison, D. and Rubinfeld, D.L. (1978)
*Hedonic prices and the demand for clean air.*J. Environ. Economics and Management 5, 81–102. - James, G., Witten, D., Hastie, T. and Tibshirani, R. (2013).
*An Introduction to Statistical Learning with Applications in R*. Springer.

- corrplot: T. Wei & V. Simko
- gsheet: M. Conway
- stargazer: M. Hlavac
- MASS: B. Ripley & al.
- ISLR: G. James & al.
- visreg: P. Breheny & W. Burchett

To cite this course:

Warin, Thierry. 2020. “SKEMA Quantum Studio: R Nanocourses.” doi:10.6084/m9.figshare.11842416.v1.