Twitch Chat Analysis

Session 07 - 🔨 Text as data in R

Published

11.12.2024

Ziel der Anwendung: Grundlagen der Korpusexploration in R
  • Review basic knowledge of working with R, tidyverse, and ggplot2
  • Get to know the typical steps of tidy text analysis with quanteda, from tokenisation and summarisation to visualisation.

Background

Todays’s data basis: Twitch Chat & Transcripts

Transcripts & Chats of the Live-Streams from hasanabi and zackrawrr and | TheMajorityReport for the Presidential (Harris vs. Trump) and Vice-Presidential (Vance vs. Walz) Debates 2024

  • The best way to learn R is by trying. This document tries to display a version of the “normal” data processing procedure.
  • Use tidytuesday data as an example to showcase the potential

Preparation

Packages

The pacman::p_load() function from the pacman package is used to load the packages, which has several advantages over the conventional method with library():

  • Concise syntax
  • Automatic installation (if the package is not already installed)
  • Loading multiple packages at once
  • Automatic search for dependencies
pacman::p_load(
    here, taylor,
    magrittr, janitor,
    ggpubr, 
    gt, gtExtras,
    countdown, 
    quanteda, # quanteda text processing
    quanteda.textplots, 
    easystats, tidyverse
)

Import und Vorverarbeitung der Daten

chats <- qs::qread(here("local_data/chat-debates_full.qs"))$correct
transcripts <- qs::qread(here("local_data/transcripts-debates_full.qs"))$correct

🛠️ Praktische Übung

Achtung, bitte lesen!
  • Bevor Sie mit der Arbeit an den folgenden 📋 Exercises beginnen, stellen Sie bitte sicher, dass Sie alle Chunks des Abschnitts Preparation ausgeführt haben. Das können Sie tun, indem Sie den “Run all chunks above”-Knopf des nächsten Chunks benutzen.
  • Bei Fragen zum Code lohnt sich ein Blick in das Tutorial (.qmd oder .html). Beim Tutorial handelt es sich um eine kompakte Darstellung des in der Präsentation verwenden R-Codes. Sie können das Tutorial also nutzen, um sich die Code-Bausteine anzusehen, die für die R-Outputs auf den Slides benutzt wurden.

🔎 Kennenlernen des Chat-Datensatzes

📋 Exercise 1: Create corpus

  • Create new dataset corp_chats
    1. Based on the dataset chats, create a corpus object with the quanteda package.
    2. Use the corpus() function with the docid_field argument set to “message_id” and the text_field argument set to “message_content”.
    3. Check if the transformation was successful by using the summary() function.
# Create new dataset clean_tidy_tweets
corp_chats <- chats %>% 
  quanteda::corpus(
    docid_field = "message_id", 
    text_field = "message_content"
  )

# Check
summary(corp_chats)

📋 Exercise 2: Tokenization & DFM conversion

  • Create new datasets toks_chats & dfm_chats
  1. Based on the dataset corp_chats, create tokens using the tokens() function from the quanteda package.
  2. Convert the tokens to a document-feature matrix (DFM) using the dfm() function from the quanteda package.
  3. Check if the transformations were successful (e.g. by using the print() function).
# Create tokens
toks_chats <- corp_chats %>%
    quanteda::tokens() 
 
# Create DFM
dfm_chats <- toks_chats %>%
    quanteda::dfm()

# Check
toks_chats %>% print()
dfm_chats %>% print()

📋 Exercise 3: Analyse DFM

  • Based on dfm_chats
    1. Use the textstat_frequency() function from the quanteda package to get the top 50 tokens.
    2. Display the results.
  • Based on the results, what preprocessing steps could be useful?
# Top 50 Tokens
dfm_chats %>% 
  quanteda.textstats::textstat_frequency(n = 50) 

📋 Exercise 4: Preprocessing

  • Create a new dataset dfm_chats_preprocessed
    1. Based on corp_chats, preprocess the data according to the steps you think are necessary (e.g. removing punctuation, symbols, numbers, URLs, and stopwords).
    2. Depending on the steps you choose, you might need to use the tokens_remove() function from the quanteda package.
    3. Create a new DFM object dfm_chats_preprocessed.
    4. Use the textstat_frequency() function from the quanteda package on the newly created dataset to get the top 50 tokens and compare the result with the results of Exercise 3.
# Preprocessing
dfm_chats_preprocessed <- corp_chats %>% 
  quanteda::tokens(
    remove_punct = TRUE, 
    remove_symbols = TRUE,
    remove_numbers = TRUE,
    remove_url = TRUE, 
    split_hyphens = FALSE,
    split_tags = FALSE
  ) %>% 
  quanteda::tokens_remove(
    pattern = quanteda::stopwords("en")
  ) %>% 
  quanteda::dfm(
    tolower = TRUE
  )

# Check
dfm_chats_preprocessed %>% 
  quanteda.textstats::textstat_frequency(n = 50)