pacman::p_load(
here, taylor,
magrittr, janitor,
ggpubr,
gt, gtExtras,
countdown,
quanteda, # quanteda text processing
quanteda.textplots,
easystats, tidyverse
)Advanced Twitch Chat Analysis
Session 08 - 🔨 Advanced Methods in R
Ziel der Anwendung: Fortgeschrittene Korpusanalyse in R
- Review advanced methods of working with R, tidyverse, and ggplot2
- Get to know the typical steps of advanced text analysis with
quanteda, from tokenisation and summarisation to visualisation.
Background
Todays’s data basis: Advanced Text Analysis
Transcripts & Chats of the Live-Streams from hasanabi and zackrawrr and | TheMajorityReport for the Presidential (Harris vs. Trump) and Vice-Presidential (Vance vs. Walz) Debates 2024
- The best way to learn R is by trying. This document tries to display a version of the “normal” data processing procedure.
- Use
tidytuesdaydata as an example to showcase the potential
Preparation
Packages
The pacman::p_load() function from the pacman package is used to load the packages, which has several advantages over the conventional method with library():
- Concise syntax
- Automatic installation (if the package is not already installed)
- Loading multiple packages at once
- Automatic search for dependencies
Import und Vorverarbeitung der Daten
chats <- qs::qread(here("local_data/chats.qs"))
transcripts <- qs::qread(here("local_data/transcripts.qs"))
chats_spacyr <- qs::qread(here("local_data/chat-corpus_spacyr.qs"))🛠️ Praktische Übung
Achtung, bitte lesen!
- Bevor Sie mit der Arbeit an den folgenden 📋 Exercises beginnen, stellen Sie bitte sicher, dass Sie alle Chunks des Abschnitts Preparation ausgeführt haben. Das können Sie tun, indem Sie den “Run all chunks above”-Knopf
des nächsten Chunks benutzen. - Bei Fragen zum Code lohnt sich ein Blick in das Tutorial (.qmd oder .html). Beim Tutorial handelt es sich um eine kompakte Darstellung des in der Präsentation verwenden R-Codes. Sie können das Tutorial also nutzen, um sich die Code-Bausteine anzusehen, die für die R-Outputs auf den Slides benutzt wurden.
🔎 Kennenlernen des Chat-Datensatzes
📋 Exercise 1: Create corpus, token & DFM
- Create new dataset
corp_chats- Based on the dataset
chats, create a corpus object with thequantedapackage. - Use the
corpus()function with thedocid_fieldargument set to “message_id” and thetext_fieldargument set to “message_content”.
- Based on the dataset
- Create new dataset
toks_chats- Based on the dataset
corp_chats, create tokens using thetokens()function from thequantedapackage, including the removal of punctuation, symbols, numbers, URLs, and stopwords. - Use the
tokens_remove()function to remove stopwords (en).
- Based on the dataset
- Create new dataset
dfm_chats- Convert the tokens to a document-feature matrix (DFM) using the
dfm()function from thequantedapackage.
- Convert the tokens to a document-feature matrix (DFM) using the
# Create corpus
corp_chats <- chats %>%
quanteda::corpus(
docid_field = "message_id",
text_field = "message_content"
)
# Create tokens
toks_chats <- corp_chats %>%
quanteda::tokens(
remove_punct = TRUE,
remove_symbols = TRUE,
remove_numbers = TRUE,
remove_url = TRUE,
split_hyphens = FALSE,
split_tags = FALSE
) %>%
quanteda::tokens_remove(
pattern = quanteda::stopwords("en")
)
# Create DFM
dfm_chats <- toks_chats %>%
quanteda::dfm()📋 Exercise 2: Semantisches Netzwerk
- Create a semantic network based on the top 50 tokens from
dfm_chats.- Based on
dfm_chats, create an object calledtop_50_tokensby using thetopfeatures()&names()function from thequantedapackage to get the top 50 tokens. - Based on
dfm_chats, create a feature co-occurrence matrix (FCM) using thefcm()function from thequantedapackage. - Select the top 50 tokens from the FCM using the
fcm_select()function. - Create a network plot using the
textplot_network()function from thequantedapackage.
- Based on
top50_tokens <- dfm_chats %>%
topfeatures(n = 50) %>%
names()
dfm_chats %>%
fcm() %>%
fcm_select(pattern = top50_tokens) %>%
textplot_network()📋 Exercise 3: Analyse auf Basis von POS-Tagging
- Based on
chats_spacyr, analyse the adjectives associated with Trump.- Filter the dataset by using
filter()and the argumentspos == "NOUN"andlemma == "trump". - Join the dataset with itself by using inner_join() and the arguments
doc_id,sentence_id, andrelationship = "many-to-many". - Filter the dataset again for adjectives with the head token id equal to the token id of the noun. To do that, use
filter()and the argumentspos.y == "ADJ"andhead_token_id.y == token_id.x. - Rename the columns and select the relevant columns.
- Display the results using the
sjmisc::frq()function.
- Filter the dataset by using
chats_spacyr %>%
filter(
pos == "NOUN" &
lemma == "trump") %>%
inner_join(
chats_spacyr,
by = c(
"doc_id",
"sentence_id"),
relationship =
"many-to-many") %>%
filter(
pos.y == "ADJ" &
head_token_id.y == token_id.x) %>%
rename(
token_id = token_id.y,
token = token.y) %>%
select(
doc_id, sentence_id,
token_id, token) %>%
sjmisc::frq(token, sort.frq = "desc") 📋 Exercise 6: Named Entity Recognition (NER)
- Analyse the named entities in the chat data.
- Based on
chats_spacyr, use thefrq()function from thesjmiscpackage to get the frequency of named entities. - Again based on
chats_spacyr, filter the dataset for named entities of that indicate a person is mentioned (by usingfilterand the varialbeentity). Use the output of the previous step to identify the correct entity. Additionally, base all further analysis only on nouns, by usingfilterand the variablepos== "NOUN. - Use the
frq()function from thesjmiscpackage to get the frequency. To avoid display errors, use themin.frq = 10argument to only display tokens with a frequency of at least 10.
- Based on
# Identify named entities
chats_spacyr %>%
sjmisc::frq(entity, sort.frq = "desc")
# Analyse named entities
chats_spacyr %>%
filter(entity == "PERSON_B") %>%
filter(pos == "NOUN") %>%
sjmisc::frq(token, sort.frq = "desc", min.frq = 10)