R nanocourse 5: Descriptive Statistics

## Introduction

In this session, you will learn the basic functions used to produce descriptive statistics of a dataset. We will embed these analysis in a comprehensive table in the Markdown format.

## Goals

At the end of the lab, you should be able to:

1. give basic statistics of a dataset;
2. frame your result in a table;

## Statistics

Let’s consider the following dataset: 0, 0, 0, 1, 2, 3, 4, 5, 6, 6, 6, 7, 8, 9, 10, 16. We will embed these data in a dataframe called “dataStat”.

dataStat <- c(0,0,0,1,2,3,4,5,6,6,6,7,8,9,10,16)

R has many built-in functions ready to use to describe your dataset. We will explore a few of them, then present some packages that provide a more complete view on your dataset.

### Built-in functions

#### Minimum

min(dataStat)
##  0

#### Maximum

max(dataStat)
##  16

#### Median

median(dataStat)
##  5.5

#### Mean

mean(dataStat)
##  5.1875

#### Sum

sum(dataStat)
##  83

#### Variance

var(dataStat)
##  18.82917

#### Standard deviation

sd(dataStat)
##  4.339259

### Packages

#### Summary

• Minimum value;
• First quartile;
• Median value;
• Mean value;
• Third quartile;
• Maximum value.
summary(dataStat)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##   0.000   1.750   5.500   5.188   7.250  16.000

#### Hmisc

• Number of data;
• Number of missing values;
• Number of unique values;
• Mean;
• 5, 10, 25, 50, 75, 90, 95th percentiles;
• 5 lowest and 5 highest scores.
library(Hmisc)
describe(dataStat)
detach("package:Hmisc")

#### psych

• Number of values;
• Mean value;
• Standard deviation;
• Median value;
• Minimum value;
• Maximum value;
• Range;
• Skew;
• Kurtosis;
• Standard errors.
library(psych)
describe(dataStat)

#### pastecs

• Number of values;
• Nomber of NULL values;
• Number of NA values;
• Minimum value;
• Maximum value;
• Range of the dataset (max-min);
• Sum of the dataset;
• Median value;
• Mean value;
• Variance;
• Standard deviation.
library(pastecs)
stat.desc(dataStat)
##      nbr.val     nbr.null       nbr.na          min          max        range
##   16.0000000    3.0000000    0.0000000    0.0000000   16.0000000   16.0000000
##          sum       median         mean      SE.mean CI.mean.0.95          var
##   83.0000000    5.5000000    5.1875000    1.0848147    2.3122278   18.8291667
##      std.dev     coef.var
##    4.3392588    0.8364836

## Tables

Reminder from the first session: there are two ways to write R code into a RMarkdown document. (1) Inside a R chunk, which is delimited by 3 accents; or inside a line of text, which is then called inline R code, delimited by a single accent.

# This is the typography used for a R chunk:
#
# {r}
# some R code here
# 

# This is the typography used for an inline R code:
#
# The sum of the dataset is equal to r sum(dataset)

To create a table in RMarkdown, you need to specify the number of columns and rows that you will generate. This first line will be dedicated to the header, the second line will present the format to showcase your table and the subsequent lines are dedicated to the values you want to present. You can embed either text or R code in these tables.

# This is the template of a table:

# ------------- | ------------- | ------------- | -------------
# Row 1 Column 1 | Row 1 Column 2 | Row 1 Column 3 | Row 1 Column 4
# Row 2 Column 1 | Row 2 Column 2 | Row 2 Column 3 | Row 2 Column 4
# Row 3 Column 1 | Row 3 Column 2 | Row 3 Column 3 | Row 3 Column 4

• From the UNIDO dataset, please provide some descriptive statistics regarding the number of employees (Variables == 4) in the sector of wines (IsicCode == 1552) in Spain (CountryCode == 724), Italy (CountryCode == 380), France (CountryCode == 250) and the U.S. (CountryCode == 840).
• Showcase these data inside a table.

The first step of this exercice is to obtain your dataset from UNIDO and clean your data.

# Importing data
library(gsheet)
dataUnido <- gsheet2tbl(gs15x)

# # Transforming values into numerical values
dataUnido$Value <- as.numeric(dataUnido$Value)
dataUnido$Tablecode <- as.numeric(dataUnido$Tablecode)
dataUnido$CountryCode <- as.numeric(dataUnido$CountryCode)
dataUnido$Year <- as.numeric(dataUnido$Year)
dataUnido$IsicCode <- as.numeric(dataUnido$IsicCode)
dataUnido$Unit <- NULL # Data for the wine industry library(dplyr) dataUnidoWine <- filter(dataUnido, IsicCode == 1552) dataUnidoWine <- filter(dataUnidoWine, Tablecode == 4) # Cleaning the dataset into four subsets, one for each country dataUnidoFrance <- filter(dataUnidoWine, CountryCode == 250) dataUnidoItaly <- filter(dataUnidoWine, CountryCode == 380) dataUnidoSpain <- filter(dataUnidoWine, CountryCode == 724) dataUnidoUS <- filter(dataUnidoWine, CountryCode == 840) Finally, we will present some descriptive statistics on the four subsets in the following table, with the use of inline R code. Important: if a NA value is present in your dataset, it is important to not take it into account. This can be obtain by the command na.omit(). For example, to obtain the minimum of all value concerning the Italian wine sector in terms of employees, you need to call the function: min(na.omit(dataUnidoItaly$Value)).

min(na.omit(dataUnidoItaly$Value)) ##  9066 In 1994, there was 9066 employees in the Italian wine sector (i.e. the smallest number of employees in the Italian wine sector). With the same logic, you can produce a table with the minimum, average, median and maximum value for each country. Instead of min, you write average, median, max. # | Country | Minimum | Average | Median | Maximum | # | ------------------ | ------------------ | ------------------ | ------------------- | ------------------- | # | **France** | r min(na.omit(dataUnidoFrance$Value)) | r mean(na.omit(dataUnidoFrance$Value)) | r median(na.omit(dataUnidoFrance$Value)) | r max(na.omit(dataUnidoFrance$Value)) | # | **Italy** | r min(na.omit(dataUnidoItaly$Value)) | r mean(na.omit(dataUnidoItaly$Value)) | r median(na.omit(dataUnidoItaly$Value)) | r max(na.omit(dataUnidoItaly$Value)) | # | **Spain** | r min(na.omit(dataUnidoSpain$Value)) | r mean(na.omit(dataUnidoSpain$Value)) | r median(na.omit(dataUnidoSpain$Value)) | r max(na.omit(dataUnidoSpain$Value)) | # | **US** | r min(na.omit(dataUnidoUS$Value)) | r mean(na.omit(dataUnidoUS$Value)) | r median(na.omit(dataUnidoUS$Value)) | r max(na.omit(dataUnidoUS\$Value)) |

This code will generate the following table:

Country Minimum Average Median Maximum
France 1.095910^{4} 1.322345510^{4} 1.386110^{4} 1.423910^{4}
Italy 9066 1.266510^{4} 1.391310^{4} 1.568210^{4}
Spain 1.687810^{4} 1.740033310^{4} 1.765510^{4} 1.766810^{4}
US 1.819310^{4} 2.497518210^{4} 2.409310^{4} 3.448110^{4}

## Quiz

You can do some neat math tricks easily using the basic packages. Answer the question using the box underneath.

## Code learned this week

Command Detail
min() minimum value
median() median value
max() maximum value
mean() mean value
sum() sum of a dataset
var() variance of a dataset
sd() standard deviation of a dataset
summary() set of descriptive statistics
describe() from Hmisc and psych packages, set of descriptive statistics
stat.desc() from pastecs package, set of descriptive statistics

## References

### Resources

For more on the RMarkdown syntax, please refer to:

## Acknowledgments

To cite this course:

Warin, Thierry. 2020. “SKEMA Quantum Studio: R Nanocourses.” doi:10.6084/m9.figshare.11842416.v1.