pacman::p_load(
    here, taylor,
    magrittr, janitor,
    ggpubr, 
    gt, gtExtras,
    countdown, 
    quanteda, # quanteda text processing
    quanteda.textplots, 
    easystats, tidyverse
)Twitch Chat Analysis
Session 07 - 🔨 Text as data in R
Ziel der Anwendung: Grundlagen der Korpusexploration in R
- Review basic knowledge of working with R, tidyverse, and ggplot2
- Get to know the typical steps of tidy text analysis with quanteda, from tokenisation and summarisation to visualisation.
Background
Todays’s data basis: Twitch Chat & Transcripts
Transcripts & Chats of the Live-Streams from hasanabi and zackrawrr and | TheMajorityReport for the Presidential (Harris vs. Trump) and Vice-Presidential (Vance vs. Walz) Debates 2024
- The best way to learn R is by trying. This document tries to display a version of the “normal” data processing procedure.
- Use tidytuesdaydata as an example to showcase the potential
Preparation
Packages
The pacman::p_load() function from the pacman package is used to load the packages, which has several advantages over the conventional method with library():
- Concise syntax
- Automatic installation (if the package is not already installed)
- Loading multiple packages at once
- Automatic search for dependencies
Import und Vorverarbeitung der Daten
chats <- qs::qread(here("local_data/chat-debates_full.qs"))$correct
transcripts <- qs::qread(here("local_data/transcripts-debates_full.qs"))$correct🛠️ Praktische Übung
Achtung, bitte lesen!
- Bevor Sie mit der Arbeit an den folgenden 📋 Exercises beginnen, stellen Sie bitte sicher, dass Sie alle Chunks des Abschnitts Preparation ausgeführt haben. Das können Sie tun, indem Sie den “Run all chunks above”-Knopf  des nächsten Chunks benutzen. des nächsten Chunks benutzen.
- Bei Fragen zum Code lohnt sich ein Blick in das Tutorial (.qmd oder .html). Beim Tutorial handelt es sich um eine kompakte Darstellung des in der Präsentation verwenden R-Codes. Sie können das Tutorial also nutzen, um sich die Code-Bausteine anzusehen, die für die R-Outputs auf den Slides benutzt wurden.
🔎 Kennenlernen des Chat-Datensatzes
📋 Exercise 1: Create corpus
- Create new dataset corp_chats- Based on the dataset chats, create a corpus object with thequantedapackage.
- Use the corpus()function with thedocid_fieldargument set to “message_id” and thetext_fieldargument set to “message_content”.
- Check if the transformation was successful by using the summary()function.
 
- Based on the dataset 
# Create new dataset clean_tidy_tweets
corp_chats <- chats %>% 
  quanteda::corpus(
    docid_field = "message_id", 
    text_field = "message_content"
  )
# Check
summary(corp_chats)📋 Exercise 2: Tokenization & DFM conversion
- Create new datasets toks_chats&dfm_chats
- Based on the dataset corp_chats, create tokens using thetokens()function from thequantedapackage.
- Convert the tokens to a document-feature matrix (DFM) using the dfm()function from thequantedapackage.
- Check if the transformations were successful (e.g. by using the print()function).
# Create tokens
toks_chats <- corp_chats %>%
    quanteda::tokens() 
 
# Create DFM
dfm_chats <- toks_chats %>%
    quanteda::dfm()
# Check
toks_chats %>% print()
dfm_chats %>% print()📋 Exercise 3: Analyse DFM
- Based on dfm_chats- Use the textstat_frequency()function from thequantedapackage to get the top 50 tokens.
- Display the results.
 
- Use the 
- Based on the results, what preprocessing steps could be useful?
# Top 50 Tokens
dfm_chats %>% 
  quanteda.textstats::textstat_frequency(n = 50) 📋 Exercise 4: Preprocessing
- Create a new dataset dfm_chats_preprocessed- Based on corp_chats, preprocess the data according to the steps you think are necessary (e.g. removing punctuation, symbols, numbers, URLs, and stopwords).
- Depending on the steps you choose, you might need to use the tokens_remove()function from thequantedapackage.
- Create a new DFM object dfm_chats_preprocessed.
- Use the textstat_frequency()function from thequantedapackage on the newly created dataset to get the top 50 tokens and compare the result with the results of Exercise 3.
 
- Based on 
# Preprocessing
dfm_chats_preprocessed <- corp_chats %>% 
  quanteda::tokens(
    remove_punct = TRUE, 
    remove_symbols = TRUE,
    remove_numbers = TRUE,
    remove_url = TRUE, 
    split_hyphens = FALSE,
    split_tags = FALSE
  ) %>% 
  quanteda::tokens_remove(
    pattern = quanteda::stopwords("en")
  ) %>% 
  quanteda::dfm(
    tolower = TRUE
  )
# Check
dfm_chats_preprocessed %>% 
  quanteda.textstats::textstat_frequency(n = 50)