Text Mining with R

Preface

The methodology used in this course is based on the book Text Mining with R by Silge and Robinson (2019). This book serves as an introduction of text mining using the tidytext package and other tidy tools in R.

To provide an example of the use of Silge and Robinson (2019)‘s methodology, data were downloaded via Twitter. Three keywords were used to produced three datasets: one containing the tweets with the keyword ’artificial intelligence’, another containing the tweets with the keyword ‘artificialintelligence’ and the last one containing the tweets with the keyword ‘AI’.

These are the three datasets of Twitter used for this course. The one called tweetsArtificial_Intelligence.csv includes data with the keyword ‘artificial intelligence’. The second one called tweetsArtificialIntelligence.csv includes data with the keyword ‘artificialintelligence’. The third one called tweetsAI.csv includes data with the keyword ‘AI’.

tweetsArtificial_Intelligence <-read.csv("tweetsArtificial_Intelligence.csv")
tweetsArtificialIntelligence <-read.csv("tweetsArtificialIntelligence.csv")
tweetsAI <-read.csv("tweetsAI.csv")

To use the Silge and Robinson (2019)’s methodology, a cleaning is needed on the datasets. First, let’s bind the three data frames into one using bind_rows(). Then, four columns have to be removed using select() to keep unique tweets because the same tweet can be in the three data frames. As the counts of likes or retweets can change between two downloads, the columns containing this information is deleted.

library(dplyr)

tweets <- bind_rows(tweetsArtificialIntelligence, tweetsArtificial_Intelligence)
tweets <- bind_rows(tweets, tweetsAI)

tweets <- select(tweets, -replies_count, -retweets_count, -likes_count, -place)

tweets <- unique(tweets)

The second part of the cleaning is to remove retweet entities, @people, the punctuation, numbers, html links and pictwitter elements.

# remove retweet entities
tweets$tweet <- gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", tweets$tweet)

# remove at people
tweets$tweet <- gsub("@\\w+", "", tweets$tweet)

# remove punctuation
tweets$tweet <- gsub("[[:punct:]]", "", tweets$tweet)

# remove numbers
tweets$tweet <- gsub("[[:digit:]]", "", tweets$tweet)

# remove html links
tweets$tweet <- gsub("http\\w+", "", tweets$tweet)

# remove all pictwitter
tweets$tweet <- gsub("pictwitter\\w+ *", "", tweets$tweet)

1 The tidy text format

The unnest_tokens function

[1]

Let’s only keep the column tweet containing the text and put it in a vector called text as in the book.

text <- tweets$tweet

head(text)

To analyse the character vector, we first need to turn it into a tidy text dataset. To do so, let’s put it into a tibble data frame. As described in the book, a tibble is a modern class of data frame within R, available in the dplyr and tibble packages, that has a convenient print method, will not convert strings to factors, and does not use row names. Tibbles are great for use with tidy tools.

text_df <- tibble(line = 1:length(text), text = text)

text_df

To break the text into individual tokens (one-token-per-row format) and transform it to a tidy data structure, we use the tidytext’s unnest_tokens() function.

library(tidytext)

tidy_tweets <- text_df %>% 
  unnest_tokens(word, text)

tidy_tweets 

Tidying the tweets of AI

[2]

Now that the data is in one-word-per-row format, we can manipulate it with tidy tools like dplyr. To remove stop words (common words such as “the”, “of”, “to”) let’s use the anti_join() function.

data(stop_words)

tidy_tweets <- tidy_tweets %>%
  anti_join(stop_words)

Word frequencies

[3]

A common task in text mining is to look at word frequencies. We can use dplyr’s count() to find the most common words in all tweets.

tidy_tweets %>%
  count(word, sort = TRUE) 

Because of the use of tidy tools, the word counts are stored in a tidy data frame. This allows us to pipe this directly to the ggplot2 package, for example to create a visualization of the most common words.

library(ggplot2)

tidy_tweets %>%
  count(word, sort = TRUE) %>%
  filter(n > 600) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() + 
  theme_light()

2 Sentiment analysis with tidy data

The topic of opinion mining or sentiment analysis is addressed in this section. We can use the tools of text mining to approach the emotional content of text programmatically.

Sentiment analysis with inner join

[1]

Now that the text is in a tidy format with one word per row, we are ready to do the sentiment analysis. First, let’s use the NRC lexicon and filter() for the joy words. Next, let’s use inner_join() to perform the sentiment analysis. Finally, let’s use count() to count the most common joy words.

nrc_joy <- get_sentiments("nrc") %>% 
  filter(sentiment == "joy")

tidy_tweets %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE)

Next, we count up how many positive and negative words there are in the tweets. We define an index here to keep track of where we are in the narrative; this index (using integer division) counts up sections of 100 tweets.

As the keywords for downloading the tweets were ‘artificialintelligence’, ‘artificial intelligence’ and ‘AI’, we need to extract them from the dataset to produce the following plot.

library(tidyr)

tweets_sentiment <- tidy_tweets %>%
  filter(!word %in% c("artificialintelligence", "intelligence", "artificial", "ai")) %>%
  inner_join(get_sentiments("bing")) %>%
  count(index = line %/% 100, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)
library(ggplot2)

ggplot(tweets_sentiment, aes(index, sentiment)) + 
  geom_col(show.legend = FALSE) + 
  theme_light()

Comparing the three sentiment dictionaries

[2]

With several options for sentiment lexicons, you might want some more information on which one is appropriate for your purposes. Let’s use all three sentiment lexicons and examine how the sentiment changes across the tweets.

tidy_tweets

Let’s again use integer division (%/%) to define larger sections of text that span multiple lines, and we can use the same pattern with count(), spread(), and mutate() to find the net sentiment in each of these sections of text.

afinn <- tidy_tweets %>% 
  filter(!word %in% c("artificialintelligence", "intelligence", "artificial", 'ai')) %>%
  inner_join(get_sentiments("afinn")) %>% 
  group_by(index = line %/% 100) %>% 
  summarise(sentiment = sum(value)) %>% 
  mutate(method = "AFINN")

bing_and_nrc <- bind_rows(tidy_tweets %>% 
                            filter(!word %in% c("artificialintelligence", "intelligence", "artificial", "ai")) %>%
                            inner_join(get_sentiments("bing")) %>%
                            mutate(method = "Bing et al."),
                          tidy_tweets %>% 
                            filter(!word %in% c("artificialintelligence", "intelligence", "artificial", "ai")) %>%
                            inner_join(get_sentiments("nrc") %>% 
                                         filter(sentiment %in% c("positive", "negative"))) %>%
                            mutate(method = "NRC")) %>%
  count(method, index = line %/% 100, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)

We now have an estimate of the net sentiment (positive - negative) in each chunk of the novel text for each sentiment lexicon. Let’s bind them together and visualize them.

bind_rows(afinn, 
          bing_and_nrc) %>%
  ggplot(aes(index, sentiment, fill = method)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~method, ncol = 1, scales = "free_y") + 
  theme_light()

Most common positive and negative words

[3]

One advantage of having the data frame with both sentiment and word is that we can analyze word counts that contribute to each sentiment. By implementing count() here with arguments of both word and sentiment, we find out how much each word contributed to each sentiment.

bing_word_counts <- tidy_tweets %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()

bing_word_counts

This can be shown visually, and we can pipe straight into ggplot2, if we like, because of the way we are consistently using tools built for handling tidy data frames.

bing_word_counts %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment",
       x = NULL) +
  coord_flip() + 
  theme_light()

The figure above lets us spot an anomaly in the sentiment analysis; the word “cloud” is coded as negative but it is used as a computing term to describe data centers available over the Internet. We can also observe that the world “intelligence” is coded as a positive word, which is correct, but it was one of the keyword used to download the data, so it has to be removed as it can biaise the analysis.

For that purpose, we could easily add “cloud” and “intelligence” to a custom stop-words list using bind_rows(). We could implement that with a strategy such as this.

custom_stop_words_sentiment <- bind_rows(tibble(word = c("cloud", "intelligence", "artificial", "artificialintelligence", "ai"), 
                                          lexicon = c("custom")), 
                               stop_words)

custom_stop_words_sentiment
bing_word_counts_custom <- bing_word_counts %>%
  anti_join(custom_stop_words_sentiment)

bing_word_counts_custom
bing_word_counts_custom %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment",
       x = NULL) +
  coord_flip() + 
  theme_light()

Wordclouds

[4]

We’ve seen that this tidy text mining approach works well with ggplot2, but having our data in a tidy format is useful for other plots as well. For example, consider the wordcloud package, which uses base R graphics. Let’s look at the most common words in the tweets as a whole again, but this time as a wordcloud.

We need to produce a new custom stop word without extracting the word “cloud” because it is an important word but not for sentiment analysis as it is counted as a negative word.

custom_stop_words <- bind_rows(tibble(word = c("intelligence", "artificial", "artificialintelligence", "ai"), 
                                          lexicon = c("custom")), 
                               stop_words)

custom_stop_words
library(wordcloud)

tidy_tweets %>%
  anti_join(custom_stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 150))

In other functions, such as comparison.cloud(), you may need to turn the data frame into a matrix with reshape2’s acast(). Let’s do the sentiment analysis to tag positive and negative words using an inner join, then find the most common positive and negative words. Until the step where we need to send the data to comparison.cloud(), this can all be done with joins, piping, and dplyr because our data is in tidy format.

library(reshape2)

tidy_tweets %>%
  anti_join(custom_stop_words_sentiment) %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("gray20", "gray80"), max.words = 150)

The size of a word’s text is in proportion to its frequency within its sentiment. We can use this visualization to see the most important positive and negative words, but the sizes of the words are not comparable across sentiments.

Looking at units beyond just words

[5]

We can use tidy text analysis to found out the most negative tweets of the datasets. First, let’s get the list of negative words from the Bing lexicon. Second, let’s make a data frame of how many words are in each tweet so we can normalize for the length of tweets. Then, let’s find the number of negative words in each tweet and divide by the total words in each tweets.

bingnegative <- get_sentiments("bing") %>% 
  filter(sentiment == "negative")

wordcounts <- tidy_tweets %>%
  group_by(line) %>%
  summarize(words = n())

tidy_tweets %>%
  semi_join(bingnegative) %>%
  group_by(line) %>%
  summarize(negativewords = n()) %>%
  left_join(wordcounts, by = "line") %>%
  mutate(ratio = negativewords/words) %>%
  ungroup()

3 Analyzing word and document frequency

In this section of the book, the code is used on Jane Austen’s novels. Indeed, each calculation is based on the different books. We do not have books, but we do have tweets. To continue this course, we merge the three datasets as before, but we add a column called keyword containing the keyword of each one. Instead of calculating the frequency per book, we are going to calculate the frequency per keyword.

tweets <- bind_rows(tweetsArtificial_Intelligence %>% 
                      mutate(keyword = "Artificial_Intelligence"),
                    tweetsArtificialIntelligence %>% 
                      mutate(keyword = "ArtificialIntelligence"), 
                    tweetsAI %>% 
                      mutate(keyword = "AI"))

tweets
# remove retweet entities
tweets$tweet <- gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", tweets$tweet)
# remove at people
tweets$tweet <- gsub("@\\w+", "", tweets$tweet)
# remove punctuation
tweets$tweet <- gsub("[[:punct:]]", "", tweets$tweet)
# remove numbers
tweets$tweet <- gsub("[[:digit:]]", "", tweets$tweet)
# remove html links
tweets$tweet <- gsub("http\\w+", "", tweets$tweet)
# remove all pictwitter
tweets$tweet <- gsub("pictwitter\\w+ *", "", tweets$tweet)
text <- select(tweets, tweet, keyword)

text_df <- as_tibble(text)

text_df

Term frequency in AI tweets

[1]

Let’s start by examining first term frequency, then tf-idf. We can start just by using dplyr verbs such as group_by() and join(). Let’s calculate the most commonly used words and also the total words in each tweet, for later use

library(dplyr)
library(tidytext)

tweets_words <- text_df %>%
  unnest_tokens(word, tweet) %>%
  count(keyword, word, sort = TRUE)

total_words <- tweets_words %>% 
  group_by(keyword) %>% 
  summarize(total = sum(n))

tweets_words <- left_join(tweets_words, total_words)

tweets_words

There is one row in this tweetsAI_words data frame for each word-tweet combination; n is the number of times that word is used in that tweet and total is the total words in that tweet. The usual suspects are here with the highest n, “the”, “and”, “to”, and so forth.

In the following figure, let’s look at the distribution of n/total for each tweet, the number of times a word appears in a tweet divided by the total number of terms (words) in that tweet. This is exactly what term frequency is.

library(ggplot2)
  
ggplot(tweets_words, aes(n/total, fill = keyword)) +
  geom_histogram(show.legend = FALSE) +
  xlim(NA, 0.002) +
  facet_wrap(~keyword, ncol = 1, scales = "free_y") + 
  theme_light()

Zipf’s law

[2]

Since we have the data frame we used to plot term frequency, we can examine Zipf’s law for the AI tweets with just a few lines of dplyr functions.

freq_by_rank <- tweets_words %>% 
  group_by(keyword) %>% 
  mutate(rank = row_number(), 
         `term frequency` = n/total)

freq_by_rank

The rank column here tells us the rank of each word within the frequency table; the table was already ordered by n so we could use row_number() to find the rank. Then, we can calculate the term frequency in the same way we did before.

Let’s visualise the Zipf’s law by plotting rank on the x-axis and term frequency on the y-axis, on logarithmic scales.

freq_by_rank %>% 
  ggplot(aes(rank, `term frequency`, color = keyword)) + 
  geom_line(size = 1.1, alpha = 0.8, show.legend = FALSE) + 
  scale_x_log10() +
  scale_y_log10() + 
  theme_light()

Notice that this figure is in log-log coordinates. We see that all three keywords are quite similar to each other, and that the relationship between rank and frequency does have negative slope.

Let’s see what the exponent of the power law is for the top of the rank range.

rank_subset <- freq_by_rank %>% 
  filter(rank < 500,
         rank > 1)

lm(log10(`term frequency`) ~ log10(rank), data = rank_subset)

The slope is close to -1 here.

freq_by_rank %>% 
  ggplot(aes(rank, `term frequency`, color = keyword)) + 
  geom_abline(intercept = -0.62, slope = -1.1, color = "gray50", linetype = 2) +
  geom_line(size = 1.1, alpha = 0.8, show.legend = FALSE) + 
  scale_x_log10() +
  scale_y_log10() + 
  theme_light()

The deviations we see here at high rank are not uncommon for many kinds of language; a corpus of language often contains fewer rare words than predicted by a single power law.

The bind_tf_idf function

[3]

Calculating tf-idf attempts to find the words that are important (i.e., common) in a text, but not too common. Let’s do that now.

The bind_tf_idf function in the tidytext package takes a tidy text dataset as input with one row per token (term), per document. One column (word here) contains the terms/tokens, one column contains the documents (keyword in this case), and the last necessary column contains the counts, how many times each document contains each term (n in this example). We calculated a total for each keyword for our explorations in previous sections, but it is not necessary for the bind_tf_idf function; the table only needs to contain all the words in each document.

tweets_words <- tweets_words %>%
  bind_tf_idf(word, keyword, n)

tweets_words

Notice that idf and thus tf-idf are zero for these extremely common words.

Let’s look at terms with high tf-idf in Jane Austen’s works.

tweets_words %>%
  select(-total) %>%
  arrange(desc(tf_idf))
tweets_words %>%
  arrange(desc(tf_idf)) %>%
  mutate(word = factor(word, levels = rev(unique(word)))) %>% 
  group_by(keyword) %>% 
  top_n(5) %>% 
  ungroup() %>%
  ggplot(aes(word, tf_idf, fill = keyword)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~keyword, ncol = 1, scales = "free") +
  coord_flip() + 
  theme_light()

4 Relationships between words

Tokenizing by n-gram

[1]

We’ve been using the unnest_tokens function to tokenize by word, or sometimes by sentence, which is useful for the kinds of sentiment and frequency analyses we’ve been doing so far. But we can also use the function to tokenize into consecutive sequences of words, called n-grams. By seeing how often word X is followed by word Y, we can then build a model of the relationships between them.

We do this by adding the token = "ngrams" option to unnest_tokens(), and setting n to the number of words we wish to capture in each n-gram. When we set n to 2, we are examining pairs of two consecutive words, often called “bigrams”:

library(dplyr)
library(tidytext)

tweets_bigrams <- text_df %>%
  unnest_tokens(bigram, tweet, token = "ngrams", n = 2)

tweets_bigrams

Counting and filtering n-grams

Our usual tidy tools apply equally well to n-gram analysis. We can examine the most common bigrams using dplyr’s count():

tweets_bigrams %>%
  count(bigram, sort = TRUE)

As one might expect, a lot of the most common bigrams are pairs of our keywords, such as of artificial and intelligence: what we call our custom “stop-words”. This is a useful time to use tidyr’s separate(), which splits a column into multiple based on a delimiter. This lets us separate it into two columns, “word1” and “word2”, at which point we can remove cases where either is a custom stop-word.

library(tidyr)

bigrams_separated <- tweets_bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ")

bigrams_filtered <- bigrams_separated %>%
  filter(!word1 %in% custom_stop_words$word) %>%
  filter(!word2 %in% custom_stop_words$word)

# new bigram counts:
bigram_counts <- bigrams_filtered %>% 
  count(word1, word2, sort = TRUE)

bigram_counts

In other analyses, we may want to work with the recombined words. tidyr’s unite() function is the inverse of separate(), and lets us recombine the columns into one. Thus, “separate/filter/count/unite” let us find the most common bigrams not containing our custom stop-words.

bigrams_united <- bigrams_filtered %>%
  unite(bigram, word1, word2, sep = " ")

bigrams_united

In other analyses you may be interested in the most common trigrams, which are consecutive sequences of 3 words. We can find this by setting n = 3:

text_df %>%
  unnest_tokens(trigram, tweet, token = "ngrams", n = 3) %>%
  separate(trigram, c("word1", "word2", "word3"), sep = " ") %>%
  filter(!word1 %in% custom_stop_words$word,
         !word2 %in% custom_stop_words$word,
         !word3 %in% custom_stop_words$word) %>%
  count(word1, word2, word3, sort = TRUE)

Analyzing bigrams

This one-bigram-per-row format is helpful for exploratory analyses of the text. As a simple example, we might be interested in the most common keyword “datascience” mentioned in each keyword:

bigrams_filtered %>%
  filter(word2 == "datascience") %>%
  count(keyword, word1, sort = TRUE)

A bigram can also be treated as a term in a document in the same way that we treated individual words. For example, we can look at the tf-idf of bigrams across the three keywords. These tf-idf values can be visualized within each keyword, just as we did for words.

bigram_tf_idf <- bigrams_united %>%
  count(keyword, bigram) %>%
  bind_tf_idf(bigram, keyword, n) %>%
  arrange(desc(tf_idf))

bigram_tf_idf
bigram_tf_idf %>%
  arrange(desc(tf_idf)) %>%
  mutate(bigram = factor(bigram, levels = rev(unique(bigram)))) %>% 
  group_by(keyword) %>% 
  top_n(5) %>% 
  ungroup() %>%
  ggplot(aes(bigram, tf_idf, fill = keyword)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf of bigram to keyword") +
  facet_wrap(~keyword, ncol = 2, scales = "free") +
  coord_flip() + 
  theme_light()

Using bigrams to provide context in sentiment analysis

Now that we have the data organized into bigrams, it’s easy to tell how often words are preceded by a word like “not”:

bigrams_separated %>%
  filter(word1 == "not") %>%
  count(word1, word2, sort = TRUE)

By performing sentiment analysis on the bigram data, we can examine how often sentiment-associated words are preceded by “not” or other negating words. We could use this to ignore or even reverse their contribution to the sentiment score.

Let’s use the AFINN lexicon for sentiment analysis, which you may recall gives a numeric sentiment value for each word, with positive or negative numbers indicating the direction of the sentiment.

AFINN <- get_sentiments("afinn")

AFINN

We can then examine the most frequent words that were preceded by “not” and were associated with a sentiment.

not_words <- bigrams_separated %>%
  filter(word1 == "not") %>%
  inner_join(AFINN, by = c(word2 = "word")) %>%
  count(word2, value, sort = TRUE)

not_words

For example, the most common sentiment-associated word to follow “not” was “bad”, which have a negative score of 3.

It’s worth asking which words contributed the most in the “wrong” direction. To compute that, we can multiply their value by the number of times they appear (so that a word with a value of +3 occurring 10 times has as much impact as a word with a sentiment value of +1 occurring 30 times). We visualize the result with a bar plot.

library(ggplot2)

not_words %>%
  mutate(contribution = n * value) %>%
  arrange(desc(abs(contribution))) %>%
  head(20) %>%
  mutate(word2 = reorder(word2, contribution)) %>%
  ggplot(aes(word2, n * value, fill = n * value > 0)) +
  geom_col(show.legend = FALSE) +
  xlab("Words preceded by \"not\"") +
  ylab("Sentiment value * number of occurrences") +
  coord_flip() + 
  theme_light()

The bigrams “not bad”, “not hate” and “not kill” were overwhelmingly the largest causes of misidentification, making the text seem much more negative than it is. But we can see phrases like “not win” and “not popular” sometimes suggest text is more positive than it is.

“Not” isn’t the only term that provides some context for the following word. We could pick four common words (or more) that negate the subsequent term, and use the same joining and counting approach to examine all of them at once.

negation_words <- c("not", "no", "without")

negated_words <- bigrams_separated %>%
  filter(word1 %in% negation_words) %>%
  inner_join(AFINN, by = c(word2 = "word")) %>%
  count(word1, word2, value, sort = TRUE)

We could then visualize what the most common words to follow each particular negation are.

negated_words %>%
  mutate(contribution = n * value) %>%
  arrange(desc(contribution)) %>%
  mutate(word2 = reorder(word2, contribution)) %>%
  ggplot(aes(word2, contribution, fill = contribution > 0)) +
  geom_col(show.legend = FALSE) +
  xlab("Words preceded by \"not\"") +
  ylab("Sentiment value * number of occurrences") +
  coord_flip() + 
  facet_wrap(~word1, ncol = 2, scales = "free") + 
  theme_light()

While “not bad”, “not hate” and “not kill” are still the two most common examples, we can also see pairings such as “no doubt” and “without bias”.

Visualizing a network of bigrams with ggraph

We may be interested in visualizing all of the relationships among words simultaneously, rather than just the top few at a time. As one common visualization, we can arrange the words into a network, or “graph”.

The igraph package has many powerful functions for manipulating and analyzing networks. One way to create an igraph object from tidy data is the graph_from_data_frame() function, which takes a data frame of edges with columns for “from”, “to”, and edge attributes (in this case n):

library(igraph)

# original counts
bigram_counts

# filter for only relatively common combinations
bigram_graph <- bigram_counts %>%
  filter(n > 20) %>%
  graph_from_data_frame()

bigram_graph

We can convert an igraph object into a ggraph with the ggraph function, after which we add layers to it, much like layers are added in ggplot2. For example, for a basic graph we need to add three layers: nodes, edges, and text.

library(ggraph)
set.seed(2017)

ggraph(bigram_graph, layout = "fr") +
  geom_edge_link() +
  geom_node_point() +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1)

We can add a few polishing options to make a better looking graph.

set.seed(2016)

a <- grid::arrow(type = "closed", length = unit(.05, "inches"))

ggraph(bigram_graph, layout = "fr") +
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
                 arrow = a, end_cap = circle(.05, 'inches')) +
  geom_node_point(color = "lightblue", size = 2) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
  theme_void()

Counting and correlating pairs of words with the widyr package

[2]

We’ll examine some of the ways tidy text can be turned into a wide matrix, but in this case it isn’t necessary.

Counting and correlating among sections

Consider the keyword “Artificial_Intelligence” divided into 10-line sections, as we did (with larger sections) for sentiment analysis in Chapter 2. We may be interested in what words tend to appear within the same section.

tweets_keyword_words <- text_df %>%
  filter(keyword == "ArtificialIntelligence") %>%
  mutate(section = row_number() %/% 10) %>%
  filter(section > 0) %>%
  unnest_tokens(word, tweet) %>%
  filter(!word %in% custom_stop_words$word)

tweets_keyword_words
library(widyr)

word_pairs <- tweets_keyword_words %>%
  pairwise_count(word, section, sort = TRUE)

word_pairs
word_pairs %>%
  filter(item1 == "datascience")

Pairwise correlation

The pairwise_cor() function in widyr lets us find the phi coefficient between words based on how often they appear in the same section. Its syntax is similar to pairwise_count().

word_cors <- tweets_keyword_words %>%
  group_by(word) %>%
  filter(n() >= 50) %>%
  pairwise_cor(word, section, sort = TRUE)

word_cors

This output format is helpful for exploration. For example, we could find the words most correlated with a word like “finance” using a filter operation.

word_cors %>%
  filter(item1 == "threat")

This lets us pick particular interesting words and find the other words most associated with them.

word_cors %>%
  filter(item1 %in% c("job", "education", "ethics", "health")) %>%
  group_by(item1) %>%
  top_n(6) %>%
  ungroup() %>%
  mutate(item2 = reorder(item2, correlation)) %>%
  ggplot(aes(item2, correlation)) +
  geom_bar(stat = "identity") +
  facet_wrap(~item1, scales = "free") +
  coord_flip() + 
  theme_light()

Just as we used ggraph to visualize bigrams, we can use it to visualize the correlations and clusters of words that were found by the widyr package.

set.seed(2016)

word_cors %>%
  filter(correlation > .8) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) +
  geom_node_point(color = "lightblue", size = 5) +
  geom_node_text(aes(label = name), repel = TRUE) +
  theme_void()

5 Converting to and from non-tidy formats

Tidying a document-term matrix

[1]

Tidying DocumentTermMatrix objects

Let’s use the tidy_tweets for this section.

tweetsAI_td <- tidy_tweets %>%
  count(word, sort = TRUE) 

tweetsAI_td

As we’ve seen in previous chapters, this form is convenient for analysis with the dplyr, tidytext and ggplot2 packages. For example, you can perform sentiment analysis on these tweets with the approach described in Chapter 2.

tweetsAI_sentiments <- tweetsAI_td %>%
  inner_join(get_sentiments("bing"), by = "word")

tweetsAI_sentiments

This would let us visualize which words from the tweets most often contributed to positive or negative sentiment. We can see that the most common positive words include “innovation”, “top”, “free”, and “improve”, while the most negative words include “cloud”, “die” and “threat”. We removed the word “intelligence” as it was a keyword used to load the tweets. The inclusion of “cloud” as a negative term is probably a mistake on the algorithm’s part, since it likely usually refers to a computing term to describe data centers available over the Internet.

library(ggplot2)

tweetsAI_sentiments <- tweetsAI_sentiments %>%
  filter(!word %in% c("intelligence"))

tweetsAI_sentiments %>%
  count(sentiment, word, wt = n) %>%
  ungroup() %>%
  filter(n >= 60) %>%
  mutate(n = ifelse(sentiment == "negative", -n, n)) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_bar(stat = "identity") +
  ylab("Contribution to sentiment") +
  coord_flip() + 
  theme_light()

Casting tidy text data into a matrix

[2]

tidytext provides cast_ verbs for converting from a tidy form to these matrices.

For example, we could take the tidied tidy_tweets dataset and cast it back into a document-term matrix using the cast_dtm() function.

tidy_tweets <- text_df %>%
  unnest_tokens(word, tweet) %>%
  anti_join(custom_stop_words) %>%
  count(keyword, word) 
tweets_dtm <- tidy_tweets %>%
  cast_dtm(keyword, word, n)

tweets_dtm

6 Topic modeling

Latent Dirichlet allocation

[1]

We can use the LDA() function from the topicmodels package, setting k = 2, to create a two-topic LDA model.

This function returns an object containing the full details of the model fit, such as how words are associated with topics and how topics are associated with documents.

library(topicmodels)

tweets_lda <- LDA(tweets_dtm, k = 2, control = list(seed = 1234))
tweets_lda

Word-topic probabilities

In Chapter 5 we introduced the tidy() method, originally from the broom package (Robinson 2017), for tidying model objects. The tidytext package provides this method for extracting the per-topic-per-word probabilities, called
β (“beta”), from the model.

library(tidytext)

tweets_topics <- tidy(tweets_lda, matrix = "beta")
tweets_topics

Notice that this has turned the model into a one-topic-per-term-per-row format. For each combination, the model computes the probability of that term being generated from that topic.

We could use dplyr’s top_n() to find the 10 terms that are most common within each topic. As a tidy data frame, this lends itself well to a ggplot2 visualization.

library(ggplot2)
library(dplyr)

tweets_top_terms <- tweets_topics %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

tweets_top_terms %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip() +
  scale_x_reordered() + 
  theme_light()

As an alternative, we could consider the terms that had the greatest difference in $ $ between topic 1 and topic 2. To constrain it to a set of especially relevant words, we can filter for relatively common words, such as those that have a $ $ greater than 1/1000 in at least one topic.

library(tidyr)

beta_spread <- tweets_topics %>%
  mutate(topic = paste0("topic", topic)) %>%
  spread(topic, beta) %>%
  filter(topic1 > .001 | topic2 > .001) %>%
  mutate(log_ratio = log2(topic2 / topic1))

beta_spread

The words with the greatest differences between the two topics are visualized in this figure.

beta_spread %>%
  count(term, wt = log_ratio) %>%
  ungroup() %>%
  filter(n >= 3 | n <= -3.5) %>%
  mutate(term = reorder(term, n)) %>%
  ggplot(aes(term, n)) +
  geom_bar(stat = "identity") +
  ylab("Log2 ratio of beta in topic 2 / topic 1") +
  coord_flip() + 
  theme_light()

We can see that the words more common in topic 2 are “health” and “innovation”, as well as words such as “analysis” and “medical”. Topic 1 was more characterized by words like “solve” and “government”, as well as financial terms such as “sales” and “finance”.

Document-topic probabilities

Besides estimating each topic as a mixture of words, LDA also models each document as a mixture of topics. We can examine the per-document-per-topic probabilities, called $ $ (“gamma”), with the matrix = "gamma" argument to tidy().

tweets_documents <- tidy(tweets_lda, matrix = "gamma")
tweets_documents

Each of these values is an estimated proportion of words from that document that are generated from that topic. For example, the model estimates that about 56% of the words in document 1 were generated from topic 1.

We can see that the keywords were drawn from a mix of the two topics. We can use tidy_tweets to check what the most common words in a keyword are.

tidy_tweets %>%
  filter(keyword == "ArtificialIntelligence") %>%
  arrange(desc(n))

7 Case study

For this chapter we continue to use the same datasets with tweets on AI.

Getting the data and distribution of tweets

[1]

Let’s use the tidyr package to convert the column time into three columns: hour, minute, second. As the tweets were downloaded in a short period (less than 48 hours), we dispose the tweets in the graph per hour instead of per year like in the book.

library(ggplot2)
library(dplyr)
library(readr)
library(tidyr)

tweetsArtificial_Intelligence <-read.csv("tweetsArtificial_Intelligence.csv")
tweetsArtificialIntelligence <-read.csv("tweetsArtificialIntelligence.csv")
tweetsAI <-read.csv("tweetsAI.csv")

tweets <- bind_rows(tweetsArtificial_Intelligence %>% 
                      mutate(keyword = "Artificial_Intelligence"),
                    tweetsArtificialIntelligence %>% 
                      mutate(keyword = "ArtificialIntelligence"),
                    tweetsAI %>% 
                      mutate(keyword = "AI"))

tweets <- separate(tweets, time, c("hour", "minute", "second"), sep = ":")
tweets$hour <- as.numeric(tweets$hour)

# remove retweet entities
tweets$tweet <- gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", tweets$tweet)
# remove at people
tweets$tweet <- gsub("@\\w+", "", tweets$tweet)
# remove punctuation
tweets$tweet <- gsub("[[:punct:]]", "", tweets$tweet)
# remove numbers
tweets$tweet <- gsub("[[:digit:]]", "", tweets$tweet)
# remove html links
tweets$tweet <- gsub("http\\w+", "", tweets$tweet)
# remove all pictwitter
tweets$tweet <- gsub("pictwitter\\w+ *", "", tweets$tweet)

custom_stop_words <- bind_rows(tibble(word = c("intelligence", "artificial", "artificialintelligence", "ai"), 
                                      lexicon = c("custom")), 
                               stop_words)
ggplot(tweets, aes(x = hour, fill = keyword)) +
  geom_histogram(position = "identity", bins = 20, show.legend = FALSE) +
  facet_wrap(~keyword, ncol = 1, scales = "free") +
  theme_light()

We can see that the tweets of each keyword are tweeted approximatively at the same hour with the same peaks.

Word frequencies

Let’s use unnest_tokens() to make a tidy data frame of all the words in our tweets, and remove the common English stop words and our custom stop words. There are certain conventions in how people use text on Twitter, so we will use a specialized tokenizer and do a bit more work with our text here than, for example, we did before.

The mutate() line cleans out some characters that we don’t want like ampersands and such.

We can take the approach shown in the filter() line that uses str_detect() from the stringr package to remove stop words.

library(tidytext)
library(stringr)

remove_reg <- "&amp;|&lt;|&gt;"
tidy_tweets <- tweets %>% 
  mutate(text = str_remove_all(tweet, remove_reg)) %>%
  unnest_tokens(word, text, token = "tweets") %>%
  filter(!word %in% custom_stop_words$word,
         !word %in% str_remove_all(custom_stop_words$word, "'"),
         str_detect(word, "[a-z]")) 

Now we can calculate word frequencies for each keyword. First, we group by keyword and count how many times each word is used in each keyword. Then we use left_join() to add a column of the total number of words used in each keyword. (This is higher for the keyword ‘ArtificialIntelligence’ since its the most used keyword.) Finally, we calculate a frequency for each keyword and word.

frequency <- tidy_tweets %>% 
  group_by(keyword) %>% 
  count(word, sort = TRUE) %>% 
  left_join(tidy_tweets %>% 
              group_by(keyword) %>% 
              summarise(total = n())) %>%
  mutate(freq = n/total)

frequency

This is a nice and tidy data frame but we would actually like to plot those frequencies on the x- and y-axes of a plot, so we will need to use spread() from tidyr make a differently shaped data frame.

library(tidyr)

frequency <- frequency %>% 
  select(keyword, word, freq) %>% 
  spread(keyword, freq) %>%
  arrange(Artificial_Intelligence, ArtificialIntelligence, AI)

frequency

Now this is ready for us to plot. Let’s use geom_jitter() so that we don’t see the discreteness at the low end of frequency as much, and check_overlap = TRUE so the text labels don’t all print out on top of each other (only some will print).

library(scales)

ggplot(frequency, aes(Artificial_Intelligence, ArtificialIntelligence)) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.25, height = 0.25) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  geom_abline(color = "red")

Comparing word usage

We just made a plot comparing raw word frequencies over our tweets on AI; now let’s find which words are more or less likely to come from each keyword using the log odds ratio. First, let’s restrict the analysis moving forward to tweets from only 24 hours.

tidy_tweets <- tidy_tweets %>%
  filter(date == "2019-11-19")

tidy_tweets$date <- as.Date(tidy_tweets$date, format = "%Y-%m-%d")

Next, let’s use str_detect() to remove Twitter usernames from the word column. After removing these, we count how many times each person uses each word and keep only the words used more than 10 times. After a spread() operation, we can calculate the log odds ratio for each word.

word_ratios <- tidy_tweets %>%
  filter(!str_detect(word, "^@")) %>%
  count(word, keyword) %>%
  group_by(word) %>%
  filter(sum(n) >= 10) %>%
  ungroup() %>%
  spread(keyword, n, fill = 0) %>%
  mutate_if(is.numeric, list(~(. + 1) / (sum(.) + 1))) %>%
  mutate(logratio = log(Artificial_Intelligence / ArtificialIntelligence)) %>%
  arrange(desc(logratio))

What are some words that have been about equally likely to come from each keywords during the period selected?

word_ratios %>% 
  arrange(abs(logratio))
word_ratios %>%
  group_by(logratio < 0) %>%
  top_n(10, abs(logratio)) %>%
  ungroup() %>%
  mutate(word = reorder(word, logratio)) %>%
  ggplot(aes(word, logratio, fill = logratio < 0)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  ylab("log odds ratio (Artificial_Intelligence/ArtificialIntelligence)") +
  scale_fill_discrete(name = "", labels = c("Artificial_Intelligence", "ArtificialIntelligence")) + 
  theme_light()

So the tweets with the keyword ‘artificial intelligence’ is used when people talk about text analytics, datavisualization, analyses, while the tweets with the keyword ‘artificialintelligence’ are used when people talk about iot, artificial intelligence applications (such as machine learning, deep learning, neural networks) and jobs.

Changes in word use

Which words have been tweeted at a higher or lower rate over time? To do this, we will define a new time variable in the data frame that defines which unit of time each tweet was posted in. We can use floor_date() from lubridate to do this, with a unit of our choosing; using day seems to work well.

After we have the time bins defined, we count how many times each word in each time bin was used. After that, we add columns to the data frame for the total number of words used in each time bin and the total number of times each word was used. We can then filter() to only keep words used at least some minimum number of times (30, in this case).

library(lubridate)

words_by_time <- tidy_tweets %>%
  unite("time", hour, minute, second, sep = ":") %>%
  unite("timestamp", date, time, sep = " ")

words_by_time$timestamp <- ymd_hms(words_by_time$timestamp)

words_by_time <- words_by_time %>%
  filter(!str_detect(word, "^@")) %>%
  mutate(time_floor = floor_date(timestamp, unit = "day")) %>%
  count(time_floor, keyword, word) %>%
  group_by(keyword, time_floor) %>%
  mutate(time_total = sum(n)) %>%
  group_by(keyword, word) %>%
  mutate(word_total = sum(n)) %>%
  ungroup() %>%
  rename(count = n) %>%
  filter(word_total > 30)

words_by_time

Each row in this data frame corresponds to one keyword containing one word in a given time bin. The count column tells us how many times that keyword contains that word in that time bin, the time_total column tells us how many words that keyword contains during that time bin, and the word_total column tells us how many times that keyword contains that word over the whole day. This is the data set we can use for modeling.

We can use nest() from tidyr to make a data frame with a list column that contains little miniature data frames for each word. Let’s do that now and take a look at the resulting structure.

nested_data <- words_by_time %>%
  nest(-word, -keyword) 

nested_data

This data frame has one row for each keyword-word combination; the data column is a list column that contains data frames, one for each combination of keyword and word. Let’s use map() from purrr (Henry and Wickham, 2018) to apply our modeling procedure to each of those little data frames inside our big data frame. This is count data so let’s use glm() with family = “binomial” for modeling.

library(purrr)

nested_models <- nested_data %>%
  mutate(models = map(data, ~ glm(cbind(count, time_total) ~ time_floor, ., 
                                  family = "binomial")))

nested_models

Now notice that we have a new column for the modeling results; it is another list column and contains glm objects. The next step is to use map() and tidy() from the broom package to pull out the slopes for each of these models and find the important ones. We are comparing many slopes here and some of them are not statistically significant, so let’s apply an adjustment to the p-values for multiple comparisons.

library(broom)

slopes <- nested_models %>%
  mutate(models = map(models, tidy)) %>%
  unnest(cols = c(models)) %>%
  mutate(adjusted.p.value = p.adjust(p.value))

Now let’s find the most important slopes. Which words have changed in frequency at a moderately significant level in our tweets?

top_slopes <- slopes %>% 
  filter(adjusted.p.value < 0.05)

top_slopes

To visualize our results, we can plot these words’ use for the keyword ‘ArtificialIntelligence’.

words_by_time %>%
  inner_join(top_slopes, by = c("word", "keyword")) %>%
  filter(keyword == "ArtificialIntelligence") %>%
  mutate(count_time = count/time_total) %>%
  top_n(10, count_time) %>%
  ggplot(aes(time_floor, count_time, color = word)) +
  geom_point(size = 3) +
  labs(x = NULL, y = "Word frequency") + 
  theme_light()

Favorites and retweets

To start with, let’s look at the number of times each of our tweets was retweeted. Let’s find the total number of retweets for each keyword.

totals <- tidy_tweets %>% 
  group_by(keyword, id) %>% 
  summarise(rts = first(retweets_count)) %>% 
  group_by(keyword) %>% 
  summarise(total_rts = sum(rts))

totals

Now let’s find the median number of retweets for each word and keyword. We probably want to count each tweet/word combination only once, so we will use group_by() and summarise() twice, one right after the other. The first summarise() statement counts how many times each word was retweeted, for each tweet and keyword. In the second summarise() statement, we can find the median retweets for each keyword and word, also count the number of times each word was used for each keyword and keep that in uses. Next, we can join this to the data frame of retweet totals. Let’s filter() to only keep words mentioned at least 5 times.

word_by_rts <- tidy_tweets %>% 
  group_by(id, word, keyword) %>% 
  summarise(rts = first(retweets_count)) %>% 
  group_by(keyword, word) %>% 
  summarise(retweets = median(rts), uses = n()) %>%
  left_join(totals) %>%
  filter(retweets != 0) %>%
  ungroup()

word_by_rts %>% 
  filter(uses >= 5) %>%
  arrange(desc(retweets))

At the top of this sorted data frame, we see that the biggest retweet concerns the word secretive for the keyword ‘artificial intelligence’. Let’s plot the words that have the highest median retweets for each of keyword.

word_by_rts %>%
  filter(uses >= 5) %>%
  group_by(keyword) %>%
  top_n(5, retweets) %>%
  arrange(retweets) %>%
  ungroup() %>%
  mutate(word = factor(word, unique(word))) %>%
  ungroup() %>%
  ggplot(aes(word, retweets, fill = keyword)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ keyword, scales = "free", ncol = 2) +
  coord_flip() +
  labs(x = NULL, 
       y = "Median # of retweets for tweets containing each word") +
  theme_light()

We can follow a similar procedure to see which words led to more favorites. Are they different than the words that lead to more retweets?

totals <- tidy_tweets %>% 
  group_by(keyword, id) %>% 
  summarise(favs = first(likes_count)) %>% 
  group_by(keyword) %>% 
  summarise(total_favs = sum(favs))

word_by_favs <- tidy_tweets %>% 
  group_by(id, word, keyword) %>% 
  summarise(favs = first(likes_count)) %>% 
  group_by(keyword, word) %>% 
  summarise(favorites = median(favs), uses = n()) %>%
  left_join(totals) %>%
  filter(favorites != 0) %>%
  ungroup()

We have built the data frames we need. Now let’s make our visualization in the following figure.

word_by_favs %>%
  filter(uses >= 5) %>%
  group_by(keyword) %>%
  top_n(5, favorites) %>%
  arrange(favorites) %>%
  ungroup() %>%
  mutate(word = factor(word, unique(word))) %>%
  ungroup() %>%
  ggplot(aes(word, favorites, fill = keyword)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ keyword, scales = "free", ncol = 2) +
  coord_flip() +
  labs(x = NULL, 
       y = "Median # of likes for tweets containing each word") +
  theme_light()

References

Silge, Julia, and David Robinson. 2019. Text Mining with R. https://www.tidytextmining.com/

Acknowledgments

To cite this course:

Warin, Thierry. 2020. “SKEMA Quantum Studio: R Courses.” doi:10.6084/m9.figshare.11744013.v1.