pacman::p_load(
here, taylor,
magrittr, janitor,
ggpubr,
gt, gtExtras,
countdown,
quanteda, # quanteda text processing
quanteda.textplots,
easystats, tidyverse
)Twitch Chat Analysis
Session 07 - 🔨 Text as data in R
Ziel der Anwendung: Grundlagen der Korpusexploration in R
- Review basic knowledge of working with R, tidyverse, and ggplot2
- Get to know the typical steps of tidy text analysis with
quanteda, from tokenisation and summarisation to visualisation.
Background
Todays’s data basis: Twitch Chat & Transcripts
Transcripts & Chats of the Live-Streams from hasanabi and zackrawrr and | TheMajorityReport for the Presidential (Harris vs. Trump) and Vice-Presidential (Vance vs. Walz) Debates 2024
- The best way to learn R is by trying. This document tries to display a version of the “normal” data processing procedure.
- Use
tidytuesdaydata as an example to showcase the potential
Preparation
Packages
The pacman::p_load() function from the pacman package is used to load the packages, which has several advantages over the conventional method with library():
- Concise syntax
- Automatic installation (if the package is not already installed)
- Loading multiple packages at once
- Automatic search for dependencies
Import und Vorverarbeitung der Daten
chats <- qs::qread(here("local_data/chat-debates_full.qs"))$correct
transcripts <- qs::qread(here("local_data/transcripts-debates_full.qs"))$correct🛠️ Praktische Übung
Achtung, bitte lesen!
- Bevor Sie mit der Arbeit an den folgenden 📋 Exercises beginnen, stellen Sie bitte sicher, dass Sie alle Chunks des Abschnitts Preparation ausgeführt haben. Das können Sie tun, indem Sie den “Run all chunks above”-Knopf
des nächsten Chunks benutzen. - Bei Fragen zum Code lohnt sich ein Blick in das Tutorial (.qmd oder .html). Beim Tutorial handelt es sich um eine kompakte Darstellung des in der Präsentation verwenden R-Codes. Sie können das Tutorial also nutzen, um sich die Code-Bausteine anzusehen, die für die R-Outputs auf den Slides benutzt wurden.
🔎 Kennenlernen des Chat-Datensatzes
📋 Exercise 1: Create corpus
- Create new dataset
corp_chats- Based on the dataset
chats, create a corpus object with thequantedapackage. - Use the
corpus()function with thedocid_fieldargument set to “message_id” and thetext_fieldargument set to “message_content”. - Check if the transformation was successful by using the
summary()function.
- Based on the dataset
# Create new dataset clean_tidy_tweets
corp_chats <- chats %>%
quanteda::corpus(
docid_field = "message_id",
text_field = "message_content"
)
# Check
summary(corp_chats)📋 Exercise 2: Tokenization & DFM conversion
- Create new datasets
toks_chats&dfm_chats
- Based on the dataset
corp_chats, create tokens using thetokens()function from thequantedapackage. - Convert the tokens to a document-feature matrix (DFM) using the
dfm()function from thequantedapackage. - Check if the transformations were successful (e.g. by using the
print()function).
# Create tokens
toks_chats <- corp_chats %>%
quanteda::tokens()
# Create DFM
dfm_chats <- toks_chats %>%
quanteda::dfm()
# Check
toks_chats %>% print()
dfm_chats %>% print()📋 Exercise 3: Analyse DFM
- Based on
dfm_chats- Use the
textstat_frequency()function from thequantedapackage to get the top 50 tokens. - Display the results.
- Use the
- Based on the results, what preprocessing steps could be useful?
# Top 50 Tokens
dfm_chats %>%
quanteda.textstats::textstat_frequency(n = 50) 📋 Exercise 4: Preprocessing
- Create a new dataset
dfm_chats_preprocessed- Based on
corp_chats, preprocess the data according to the steps you think are necessary (e.g. removing punctuation, symbols, numbers, URLs, and stopwords). - Depending on the steps you choose, you might need to use the
tokens_remove()function from thequantedapackage. - Create a new DFM object
dfm_chats_preprocessed. - Use the
textstat_frequency()function from thequantedapackage on the newly created dataset to get the top 50 tokens and compare the result with the results of Exercise 3.
- Based on
# Preprocessing
dfm_chats_preprocessed <- corp_chats %>%
quanteda::tokens(
remove_punct = TRUE,
remove_symbols = TRUE,
remove_numbers = TRUE,
remove_url = TRUE,
split_hyphens = FALSE,
split_tags = FALSE
) %>%
quanteda::tokens_remove(
pattern = quanteda::stopwords("en")
) %>%
quanteda::dfm(
tolower = TRUE
)
# Check
dfm_chats_preprocessed %>%
quanteda.textstats::textstat_frequency(n = 50)