Sentiment analysis in large volumes of text data

The current tendency text to everything that is being incorporated by AI generative models reveals that the principal way of communication in the digital era is writing. Therefore, the majority of the data that we create is unstructured, which means that it is difficult to analyze and extract meaningful insights. This is where text mining and NLP come in. These tools allow us to extract valuable insights and patterns from large volumes of text data. Companies can use these tools to analyze customer feedback, social media data, news articles, form responses, and other kinds of unstructured data to obtain valuable information about their customers’ preferences, opinions, and behaviors.

Here we show an example of sentiment analysis, one of the main techniques of NLP for determining the emotional tone of a text. Then, we can identify the polarity of the text and how it is perceived. Specifically, we analyze the differences of language and style in 3 books of fantasy literature available on Project Gutenberg:

First, I prepare the data, creating line numbers, tokenizing the text, and removing stopwords of each book:

fantasy <- gutenberg_download(c(8800, 829, 28885), meta_fields = 'author')

tidy_fantasy = fantasy %>%
  mutate(author = str_replace(author, ", | ", "_")) %>% 
  group_by(gutenberg_id) %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(text, 
                                     regex("^chapter|^CANTO [\\divxlc]" ,
                                          ignore_case = TRUE)))
         ) %>%
  ungroup() %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words) %>% 
  mutate(word = str_extract(word, "[a-z']+"))

tidy_fantasy
gutenberg_id author linenumber chapter word
829 Swift_Jonathan 229 1 shore
829 Swift_Jonathan 229 1 country
829 Swift_Jonathan 229 1 lilliput
829 Swift_Jonathan 229 1 prisoner
829 Swift_Jonathan 230 1 carried
829 Swift_Jonathan 230 1 country
829 Swift_Jonathan 233 1 father
829 Swift_Jonathan 233 1 estate
829 Swift_Jonathan 233 1 nottinghamshire
829 Swift_Jonathan 234 1 sons

Then, I use the lexicon “Bing” to assign a sentiment to each word, joining it with our dataframe. We counted positive and negative words in 80-line chunks, and pivoted to wide format, so that each sentiment has its own column. Finally, we create a column to give an overall sentiment score to each chunk:

fantasy_sentiment = tidy_fantasy %>%
  inner_join(get_sentiments("bing")) %>%
  count(author, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
  mutate(sentiment = positive - negative)
  
fantasy_sentiment
author index negative positive sentiment
Carroll_Lewis 0 16 13 -3
Carroll_Lewis 1 13 4 -9
Carroll_Lewis 2 20 8 -12
Carroll_Lewis 3 22 14 -8
Carroll_Lewis 4 18 9 -9
Carroll_Lewis 5 22 10 -12
Carroll_Lewis 6 23 10 -13
Carroll_Lewis 7 22 11 -11
Carroll_Lewis 8 14 9 -5
Carroll_Lewis 9 13 3 -10

Then we plot the sentiment distribution of the texts:

ggplot(fantasy_sentiment, aes(index, sentiment, fill = author)) +
  geom_col(show.legend = TRUE) +
  facet_wrap(~author, ncol = 1, scales = "free_x")+
  labs(title="Sentiment score by Author", y="Sentiment Score", x="Index") + 
  scale_fill_manual(values = c("#df539b", "#b249ae", "#5d3195")) + 
  theme_minimal() +
  theme(
    legend.position = "none",
    plot.title = element_text(face = "bold"),
    panel.background = element_rect(fill = "white"),
    plot.background = element_rect(fill = "white")
  )

sentiment_score

We identify trends and patterns in sentiments that reflect the content of each book.

GitHub