R nanocourse 5: Descriptive Statistics

Introduction

In this session, you will learn the basic functions used to produce descriptive statistics of a dataset. We will embed these analysis in a comprehensive table in the Markdown format.

Goals

At the end of the lab, you should be able to:

  1. give basic statistics of a dataset;
  2. frame your result in a table;

Statistics

Let’s consider the following dataset: 0, 0, 0, 1, 2, 3, 4, 5, 6, 6, 6, 7, 8, 9, 10, 16. We will embed these data in a dataframe called “dataStat”.

dataStat <- c(0,0,0,1,2,3,4,5,6,6,6,7,8,9,10,16)

R has many built-in functions ready to use to describe your dataset. We will explore a few of them, then present some packages that provide a more complete view on your dataset.

Built-in functions

Minimum

min(dataStat)
## [1] 0

Maximum

max(dataStat)
## [1] 16

Median

median(dataStat)
## [1] 5.5

Mean

mean(dataStat)
## [1] 5.1875

Sum

sum(dataStat)
## [1] 83

Variance

var(dataStat)
## [1] 18.82917

Standard deviation

sd(dataStat)
## [1] 4.339259

Packages

Summary

  • Minimum value;
  • First quartile;
  • Median value;
  • Mean value;
  • Third quartile;
  • Maximum value.
summary(dataStat)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.750   5.500   5.188   7.250  16.000

Hmisc

  • Number of data;
  • Number of missing values;
  • Number of unique values;
  • Mean;
  • 5, 10, 25, 50, 75, 90, 95th percentiles;
  • 5 lowest and 5 highest scores.
library(Hmisc)
describe(dataStat)
detach("package:Hmisc")

psych

  • Number of values;
  • Mean value;
  • Standard deviation;
  • Median value;
  • Minimum value;
  • Maximum value;
  • Range;
  • Skew;
  • Kurtosis;
  • Standard errors.
library(psych)
describe(dataStat)

pastecs

  • Number of values;
  • Nomber of NULL values;
  • Number of NA values;
  • Minimum value;
  • Maximum value;
  • Range of the dataset (max-min);
  • Sum of the dataset;
  • Median value;
  • Mean value;
  • Variance;
  • Standard deviation.
library(pastecs)
stat.desc(dataStat)
##      nbr.val     nbr.null       nbr.na          min          max        range 
##   16.0000000    3.0000000    0.0000000    0.0000000   16.0000000   16.0000000 
##          sum       median         mean      SE.mean CI.mean.0.95          var 
##   83.0000000    5.5000000    5.1875000    1.0848147    2.3122278   18.8291667 
##      std.dev     coef.var 
##    4.3392588    0.8364836

Tables

Reminder from the first session: there are two ways to write R code into a RMarkdown document. (1) Inside a R chunk, which is delimited by 3 accents; or inside a line of text, which is then called inline R code, delimited by a single accent.

# This is the typography used for a R chunk:
#
# ```{r}
# some R code here
# ```

# This is the typography used for an inline R code:
#
# The sum of the dataset is equal to `r sum(dataset)`

To create a table in RMarkdown, you need to specify the number of columns and rows that you will generate. This first line will be dedicated to the header, the second line will present the format to showcase your table and the subsequent lines are dedicated to the values you want to present. You can embed either text or R code in these tables.

# This is the template of a table:

# Header keyword 1 | Header keyword 2 | Header keyword 3 | Header keyword 4
# ------------- | ------------- | ------------- | -------------
# Row 1 Column 1 | Row 1 Column 2 | Row 1 Column 3 | Row 1 Column 4
# Row 2 Column 1 | Row 2 Column 2 | Row 2 Column 3 | Row 2 Column 4
# Row 3 Column 1 | Row 3 Column 2 | Row 3 Column 3 | Row 3 Column 4

Task:

  • From the UNIDO dataset, please provide some descriptive statistics regarding the number of employees (Variables == 4) in the sector of wines (IsicCode == 1552) in Spain (CountryCode == 724), Italy (CountryCode == 380), France (CountryCode == 250) and the U.S. (CountryCode == 840).
  • Showcase these data inside a table.

The first step of this exercice is to obtain your dataset from UNIDO and clean your data.

# Importing data
gs15x <- "https://docs.google.com/spreadsheets/d/1aTJFKmkH2oxYcg0aiWeMAttGWKdM1u2KS5OyIlUkI6Q/edit?usp=sharing"
library(gsheet)
dataUnido <- gsheet2tbl(gs15x)

# # Transforming values into numerical values
dataUnido$Value <- as.numeric(dataUnido$Value)
dataUnido$Tablecode <- as.numeric(dataUnido$Tablecode)
dataUnido$CountryCode <- as.numeric(dataUnido$CountryCode)
dataUnido$Year <- as.numeric(dataUnido$Year)
dataUnido$IsicCode <- as.numeric(dataUnido$IsicCode)
dataUnido$Unit <- NULL

# Data for the wine industry
library(dplyr)
dataUnidoWine <- filter(dataUnido, IsicCode == 1552)
dataUnidoWine <- filter(dataUnidoWine, Tablecode == 4)

# Cleaning the dataset into four subsets, one for each country
dataUnidoFrance <- filter(dataUnidoWine, CountryCode == 250)
dataUnidoItaly <- filter(dataUnidoWine, CountryCode == 380)
dataUnidoSpain <- filter(dataUnidoWine, CountryCode == 724)
dataUnidoUS <- filter(dataUnidoWine, CountryCode == 840)

Finally, we will present some descriptive statistics on the four subsets in the following table, with the use of inline R code.

Important: if a NA value is present in your dataset, it is important to not take it into account. This can be obtain by the command na.omit().

For example, to obtain the minimum of all value concerning the Italian wine sector in terms of employees, you need to call the function: min(na.omit(dataUnidoItaly$Value)).

min(na.omit(dataUnidoItaly$Value))
## [1] 9066

In 1994, there was 9066 employees in the Italian wine sector (i.e. the smallest number of employees in the Italian wine sector).

With the same logic, you can produce a table with the minimum, average, median and maximum value for each country. Instead of min, you write average, median, max.

# | Country | Minimum | Average | Median | Maximum |
# | ------------------ | ------------------ | ------------------ | ------------------- | ------------------- |
# | **France** | `r min(na.omit(dataUnidoFrance$Value))` | `r mean(na.omit(dataUnidoFrance$Value))` | `r median(na.omit(dataUnidoFrance$Value))` | `r max(na.omit(dataUnidoFrance$Value))` |
# | **Italy** | `r min(na.omit(dataUnidoItaly$Value))` | `r mean(na.omit(dataUnidoItaly$Value))` | `r median(na.omit(dataUnidoItaly$Value))` | `r max(na.omit(dataUnidoItaly$Value))` |
# | **Spain** | `r min(na.omit(dataUnidoSpain$Value))` | `r mean(na.omit(dataUnidoSpain$Value))` | `r median(na.omit(dataUnidoSpain$Value))` | `r max(na.omit(dataUnidoSpain$Value))` |
# | **US** | `r min(na.omit(dataUnidoUS$Value))` | `r mean(na.omit(dataUnidoUS$Value))` | `r median(na.omit(dataUnidoUS$Value))` | `r max(na.omit(dataUnidoUS$Value))` |

This code will generate the following table:

Country Minimum Average Median Maximum
France 1.095910^{4} 1.322345510^{4} 1.386110^{4} 1.423910^{4}
Italy 9066 1.266510^{4} 1.391310^{4} 1.568210^{4}
Spain 1.687810^{4} 1.740033310^{4} 1.765510^{4} 1.766810^{4}
US 1.819310^{4} 2.497518210^{4} 2.409310^{4} 3.448110^{4}

Quiz

You can do some neat math tricks easily using the basic packages. Answer the question using the box underneath.

Code learned this week

Command Detail
min() minimum value
median() median value
max() maximum value
mean() mean value
sum() sum of a dataset
var() variance of a dataset
sd() standard deviation of a dataset
summary() set of descriptive statistics
describe() from Hmisc and psych packages, set of descriptive statistics
stat.desc() from pastecs package, set of descriptive statistics

References

Resources

For more on the RMarkdown syntax, please refer to:

Packages

Acknowledgments

To cite this course:

Warin, Thierry. 2020. “SKEMA Quantum Studio: R Nanocourses.” doi:10.6084/m9.figshare.11842416.v1.