Topic Modeling on Gutenberg books

Last updated on Aug 2, 2020 11 min read

Topic modeling is an unsupervised machine learning technique for text data. It scans a set of documents(book chapters in this case), detecting word and phrase patterns within them, and clustering word groups and similar expressions that best characterize a set of documents.

Latent Dirichlet Allocation (LDA) is an example of topic modeling algorithm used to cluster text in a document to a particular topic. It is guided by two principles - every document is a mixture of topics and every topic is a mixture of words.

Suppose we have four books from Project Gutenberg library sub-categories: Crime, Music, Astronomy and Revolution. We tore the books into pages and shuffled without reading them.

Let’s use topic modeling using LDA algorithm to see whether it can correctly distinguish the four books.

Step-by-step explanations are as follows:

library(pacman)
p_load(topicmodels,gutenbergr,tidyverse,tidytext,stringr,scales)

1.Book Titles

# four books for topic models
titles <- c("Buccaneers and Pirates of Our Coasts", "Beethoven", "Astronomy for Amateurs","The Psychology of Revolution")
titles

## [1] "Buccaneers and Pirates of Our Coasts"
## [2] "Beethoven"                           
## [3] "Astronomy for Amateurs"              
## [4] "The Psychology of Revolution"

# retrieve books from gutenbergr
books <- gutenberg_download(c(448,15141,25267,17188), meta_fields = "title")

2.Pre-Processing

# divide into documents, each representing one chapter
by_chapter <- books %>%
  group_by(title) %>%
  mutate(chapter = cumsum(str_detect(text, regex("^chapter ", ignore_case = TRUE)))) %>%
  ungroup() %>%
  filter(chapter > 0) %>%
  unite(document, title, chapter)

# split into words
by_chapter_word <- by_chapter %>%
  unnest_tokens(word, text)

# find document-word counts
word_counts <- by_chapter_word %>%
  anti_join(stop_words) %>%
  count(document, word, sort = TRUE) %>%
  ungroup()

## Joining, by = "word"

word_counts

## # A tibble: 67,284 x 3
##    document                                                       word         n
##    <chr>                                                          <chr>    <int>
##  1 "Beethoven, a character study\r\nTogether with Wagner's indeb… beethov…    87
##  2 "Astronomy for Amateurs_11"                                    distance    77
##  3 "Astronomy for Amateurs_4"                                     sun         66
##  4 "Astronomy for Amateurs_9"                                     moon        64
##  5 "Astronomy for Amateurs_3"                                     stars       63
##  6 "Astronomy for Amateurs_11"                                    sun         58
##  7 "Astronomy for Amateurs_8"                                     earth       56
##  8 "Astronomy for Amateurs_10"                                    sun         54
##  9 "Astronomy for Amateurs_2"                                     stars       53
## 10 "The Psychology of Revolution_53"                              revolut…    52
## # … with 67,274 more rows

3.LDA on Chapters

# convert tidy data to document term matrix
chapters_dtm <- word_counts %>% cast_dtm(document, word, n)
chapters_dtm

## <<DocumentTermMatrix (documents: 117, terms: 16265)>>
## Non-/sparse entries: 67284/1835721
## Sparsity           : 96%
## Maximal term length: 41
## Weighting          : term frequency (tf)

# create topic model with LDA function for four books, k = 4
chapters_lda <- LDA(chapters_dtm, k = 4, control = list(seed = 1999))
chapters_lda

## A LDA_VEM topic model with 4 topics.

#per-topic-per-word probabilities : beta
chapter_topics <- tidy(chapters_lda, matrix = "beta")
chapter_topics

## # A tibble: 65,060 x 3
##    topic term           beta
##    <int> <chr>         <dbl>
##  1     1 beethoven 6.55e-135
##  2     2 beethoven 4.39e- 18
##  3     3 beethoven 1.34e-132
##  4     4 beethoven 2.09e-  2
##  5     1 distance  7.99e-  4
##  6     2 distance  9.49e-  5
##  7     3 distance  7.75e-  3
##  8     4 distance  7.62e-  5
##  9     1 sun       8.19e-  5
## 10     2 sun       3.16e-  5
## # … with 65,050 more rows

For example, the term “beethoven” has zero probability of being generated from topics 1, 3, or 4, but it makes up 2% of topic 2.

# top 10 terms in each topic
top_terms <- chapter_topics %>% group_by(topic) %>%
  top_n(10, beta) %>% ungroup() %>%
  arrange(topic, -beta)
top_terms

## # A tibble: 40 x 3
##    topic term          beta
##    <int> <chr>        <dbl>
##  1     1 pirate     0.0117 
##  2     1 pirates    0.0108 
##  3     1 town       0.00865
##  4     1 vessel     0.00851
##  5     1 ship       0.00733
##  6     1 buccaneers 0.00622
##  7     1 time       0.00606
##  8     1 captain    0.00602
##  9     1 spanish    0.00585
## 10     1 people     0.00483
## # … with 30 more rows

# visualize top terms from each topic
top_terms %>% mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = F) + facet_wrap(~topic, scales = "free") +
  coord_flip() + scale_x_reordered() + 
  labs(caption = "Fig:The terms that are most common within each topic") + theme(plot.caption = element_text(hjust = 0.5, size = 10))

The topics are clearly associated with the four books.

4.Per-document Classification

#per-document-per-topic probabilities: gamma
chapters_gamma <- tidy(chapters_lda, matrix = "gamma")
chapters_gamma

## # A tibble: 468 x 3
##    document                                                      topic     gamma
##    <chr>                                                         <int>     <dbl>
##  1 "Beethoven, a character study\r\nTogether with Wagner's inde…     1   4.05e-6
##  2 "Astronomy for Amateurs_11"                                       1   7.54e-6
##  3 "Astronomy for Amateurs_4"                                        1   9.32e-6
##  4 "Astronomy for Amateurs_9"                                        1   1.05e-5
##  5 "Astronomy for Amateurs_3"                                        1   8.14e-6
##  6 "Astronomy for Amateurs_8"                                        1   8.58e-6
##  7 "Astronomy for Amateurs_10"                                       1   8.42e-6
##  8 "Astronomy for Amateurs_2"                                        1   9.93e-6
##  9 "The Psychology of Revolution_53"                                 1   7.75e-6
## 10 "Buccaneers and Pirates of Our Coasts_33"                         1   1.00e+0
## # … with 458 more rows

Astronomy for Amateurs_11 document has 0% probability of coming from topic 1(Buccaneers and Pirates of Our Coasts).

#separate out chapter and title
chapters_gamma <- chapters_gamma %>%
  separate(document, c("title", "chapter"), sep= "_", convert = TRUE)
chapters_gamma

## # A tibble: 468 x 4
##    title                                                 chapter topic     gamma
##    <chr>                                                   <int> <int>     <dbl>
##  1 "Beethoven, a character study\r\nTogether with Wagne…      19     1   4.05e-6
##  2 "Astronomy for Amateurs"                                   11     1   7.54e-6
##  3 "Astronomy for Amateurs"                                    4     1   9.32e-6
##  4 "Astronomy for Amateurs"                                    9     1   1.05e-5
##  5 "Astronomy for Amateurs"                                    3     1   8.14e-6
##  6 "Astronomy for Amateurs"                                    8     1   8.58e-6
##  7 "Astronomy for Amateurs"                                   10     1   8.42e-6
##  8 "Astronomy for Amateurs"                                    2     1   9.93e-6
##  9 "The Psychology of Revolution"                             53     1   7.75e-6
## 10 "Buccaneers and Pirates of Our Coasts"                     33     1   1.00e+0
## # … with 458 more rows

# reorder titles and plot
chapters_gamma %>% mutate(title = reorder(title, gamma*topic)) %>%
  ggplot(aes(factor(topic),gamma)) + geom_boxplot() + facet_wrap(~title) +
  labs(caption = "Fig: The gamma probabilities for each chapter within each book") + theme(plot.caption = element_text(hjust = 0.5, size = 12))

It appears all of the chapters are uniquely identified as a single topic.

#topic most associated with a chapter 
chapter_classifications <- chapters_gamma %>%
  group_by(title, chapter) %>% top_n(1, gamma) %>%
  ungroup()
chapter_classifications

## # A tibble: 117 x 4
##    title                                chapter topic gamma
##    <chr>                                  <int> <int> <dbl>
##  1 Buccaneers and Pirates of Our Coasts      33     1 1.00 
##  2 Buccaneers and Pirates of Our Coasts      11     1 1.00 
##  3 Buccaneers and Pirates of Our Coasts      31     1 1.00 
##  4 Buccaneers and Pirates of Our Coasts      32     1 0.981
##  5 Buccaneers and Pirates of Our Coasts      30     1 1.00 
##  6 Buccaneers and Pirates of Our Coasts      20     1 1.00 
##  7 Buccaneers and Pirates of Our Coasts      21     1 1.00 
##  8 Buccaneers and Pirates of Our Coasts      23     1 1.00 
##  9 Buccaneers and Pirates of Our Coasts      15     1 1.00 
## 10 Buccaneers and Pirates of Our Coasts      16     1 1.00 
## # … with 107 more rows

# consensus topics
book_topics <- chapter_classifications %>% count(title, topic) %>%
  group_by(title) %>% top_n(1,n) %>% ungroup() %>%
  transmute(consensus = title, topic)
book_topics%>% arrange((topic))

## # A tibble: 4 x 2
##   consensus                                                                topic
##   <chr>                                                                    <int>
## 1 "Buccaneers and Pirates of Our Coasts"                                       1
## 2 "The Psychology of Revolution"                                               2
## 3 "Astronomy for Amateurs"                                                     3
## 4 "Beethoven, a character study\r\nTogether with Wagner's indebtedness to…     4

#misidentified topics
chapter_classifications %>% inner_join(book_topics, by = "topic") %>%
  filter(title != consensus)

## # A tibble: 0 x 5
## # … with 5 variables: title <chr>, chapter <int>, topic <int>, gamma <dbl>,
## #   consensus <chr>

Indeed, no chapters were mis-classified.

5.By Word Assignments: Augment

# see which words are assigned to which topic with augment function
assignments <- augment(chapters_lda, data = chapters_dtm)
assignments

## # A tibble: 67,284 x 4
##    document                                                 term    count .topic
##    <chr>                                                    <chr>   <dbl>  <dbl>
##  1 "Beethoven, a character study\r\nTogether with Wagner's… beetho…    87      4
##  2 "Beethoven, a character study\r\nTogether with Wagner's… beetho…    45      4
##  3 "Beethoven, a character study\r\nTogether with Wagner's… beetho…    44      4
##  4 "Beethoven, a character study\r\nTogether with Wagner's… beetho…    43      4
##  5 "Beethoven, a character study\r\nTogether with Wagner's… beetho…    43      4
##  6 "Beethoven, a character study\r\nTogether with Wagner's… beetho…    39      4
##  7 "Beethoven, a character study\r\nTogether with Wagner's… beetho…    38      4
##  8 "Beethoven, a character study\r\nTogether with Wagner's… beetho…    28      4
##  9 "Beethoven, a character study\r\nTogether with Wagner's… beetho…    25      4
## 10 "Beethoven, a character study\r\nTogether with Wagner's… beetho…    24      4
## # … with 67,274 more rows

# combine assignments with true book titles to find incorrect classification
assignments <- assignments %>% 
  separate(document, c("title", "chapter"), sep = "_", convert = TRUE)%>%
  inner_join(book_topics, by = c(".topic" = "topic"))
assignments

## # A tibble: 67,284 x 6
##    title                   chapter term   count .topic consensus                
##    <chr>                     <int> <chr>  <dbl>  <dbl> <chr>                    
##  1 "Beethoven, a characte…      19 beeth…    87      4 "Beethoven, a character …
##  2 "Beethoven, a characte…      13 beeth…    45      4 "Beethoven, a character …
##  3 "Beethoven, a characte…       8 beeth…    44      4 "Beethoven, a character …
##  4 "Beethoven, a characte…       1 beeth…    43      4 "Beethoven, a character …
##  5 "Beethoven, a characte…       6 beeth…    43      4 "Beethoven, a character …
##  6 "Beethoven, a characte…       2 beeth…    39      4 "Beethoven, a character …
##  7 "Beethoven, a characte…      10 beeth…    38      4 "Beethoven, a character …
##  8 "Beethoven, a characte…       5 beeth…    28      4 "Beethoven, a character …
##  9 "Beethoven, a characte…      11 beeth…    25      4 "Beethoven, a character …
## 10 "Beethoven, a characte…      14 beeth…    24      4 "Beethoven, a character …
## # … with 67,274 more rows

#visualize a confusion matrix
assignments %>%
count(title, consensus, wt = count) %>%
group_by(title) %>%
mutate(percent = n / sum(n)) %>%
ggplot(aes(consensus, title, fill = percent)) +
geom_tile() + 
scale_fill_gradient2(high = "blue", label = percent_format()) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1),
panel.grid = element_blank()) +
labs(x = "book words were assigned to",
y = "book words came from",
fill = "% of assignments") + 
labs(caption = "Fig: Confusion matrix showing where LDA assigned the words from each book") +
theme(plot.caption = element_text( size = 10),legend.key.size = unit(0.5, "cm"))

Almost all words were correctly assigned to each topic.

# most commonly mistaken words
wrong_words <- assignments %>% filter(title!= consensus)
wrong_words

## # A tibble: 45 x 6
##    title                   chapter term   count .topic consensus                
##    <chr>                     <int> <chr>  <dbl>  <dbl> <chr>                    
##  1 "Buccaneers and Pirate…      32 moon       3      3 "Astronomy for Amateurs" 
##  2 "Buccaneers and Pirate…      32 stars      2      3 "Astronomy for Amateurs" 
##  3 "Beethoven, a characte…       4 revol…     1      2 "The Psychology of Revol…
##  4 "The Psychology of Rev…      45 capta…     1      1 "Buccaneers and Pirates …
##  5 "The Psychology of Rev…      45 town       2      1 "Buccaneers and Pirates …
##  6 "Buccaneers and Pirate…      32 heave…     1      3 "Astronomy for Amateurs" 
##  7 "The Psychology of Rev…      45 vessel     1      1 "Buccaneers and Pirates …
##  8 "Beethoven, a characte…       4 armies     2      2 "The Psychology of Revol…
##  9 "Buccaneers and Pirate…       1 1          1      4 "Beethoven, a character …
## 10 "Beethoven, a characte…       4 bonap…     3      2 "The Psychology of Revol…
## # … with 35 more rows

wrong_words %>% count(title, consensus, term , wt = count) %>% 
  ungroup() %>% arrange(-n)

## # A tibble: 45 x 4
##    title                                    consensus               term       n
##    <chr>                                    <chr>                   <chr>  <dbl>
##  1 "Beethoven, a character study\r\nTogeth… The Psychology of Revo… france     6
##  2 "Beethoven, a character study\r\nTogeth… The Psychology of Revo… bonap…     3
##  3 "Buccaneers and Pirates of Our Coasts"   Astronomy for Amateurs  moon       3
##  4 "Beethoven, a character study\r\nTogeth… The Psychology of Revo… armies     2
##  5 "Beethoven, a character study\r\nTogeth… The Psychology of Revo… louis      2
##  6 "Buccaneers and Pirates of Our Coasts"   Astronomy for Amateurs  stars      2
##  7 "The Psychology of Revolution"           Buccaneers and Pirates… town       2
##  8 "Beethoven, a character study\r\nTogeth… The Psychology of Revo… equal…     1
##  9 "Beethoven, a character study\r\nTogeth… The Psychology of Revo… gover…     1
## 10 "Beethoven, a character study\r\nTogeth… The Psychology of Revo… illus…     1
## # … with 35 more rows

The word “moon” and “stars” appear in “Buccaneers and Pirates of Our Coasts” but they are wrongly classified into “Astronomy for Amateurs” becuase they are more common in the later.

# wrongly classified word, eg. "moon"
word_counts %>% filter(word == "moon")

## # A tibble: 12 x 3
##    document                                word      n
##    <chr>                                   <chr> <int>
##  1 Astronomy for Amateurs_9                moon     64
##  2 Astronomy for Amateurs_10               moon     49
##  3 Astronomy for Amateurs_11               moon     35
##  4 Astronomy for Amateurs_4                moon     10
##  5 Astronomy for Amateurs_12               moon      9
##  6 Astronomy for Amateurs_5                moon      8
##  7 Astronomy for Amateurs_8                moon      4
##  8 Astronomy for Amateurs_3                moon      3
##  9 Astronomy for Amateurs_6                moon      3
## 10 Buccaneers and Pirates of Our Coasts_32 moon      3
## 11 Astronomy for Amateurs_1                moon      2
## 12 Astronomy for Amateurs_7                moon      2

Although the words are presumably different for each topic since books are selected from different genre, LDA algorithm performed really well on identifying topics to the document and words to the topic. Really great for unsupervised clustering!

Topic Modeling on Gutenberg books

Min Tamang

Statistics | Data Science | QA Engineer