Corpus: Transcripts

Information

Based on the transcript data, this script:

Creates a corpus, tokens, and a document-feature matrix with the quanteda package (v4.1.0, Benoit et al. 2018).
Utilizes udpipe (v0.8.11, Wijffels 2023) and spacyr (v1.3.0, Benoit and Matsuo 2023) packages for additional linguistic processing, adding lemmatization, part-of-speech tagging, and named entity recognition.

Preparation

# Load packages
source(file = here::here(
  "data_collection/00_02-setup-session.R"
))

transcripts <- qs::qread(here("local_data/transcripts-debates_full.qs"))

Process data

transcripts_corpora <- list()

# Create corpus
transcripts_corpora$corp <- transcripts$hashed %>% 
    quanteda::corpus(
        docid_field = "id_sequence", 
        text_field = "dialogue"
  )

# Create tokens
transcripts_corpora$toks <- transcripts_corpora$corp %>% 
    quanteda::tokens(
        remove_punct = TRUE, 
        remove_symbols = TRUE,
        remove_numbers = TRUE,
        remove_url = TRUE, 
        split_hyphens = FALSE,
        split_tags = FALSE
        ) %>% 
    quanteda::tokens_remove(
        pattern = quanteda::stopwords("en")
    )

# Create Document Feature Matrix (DFM)
transcripts_corpora$dfm <- transcripts_corpora$toks %>% 
    quanteda::dfm()

# Execute on first run, to download the model 
# udmodel <- udpipe::udpipe_download_model(
#     language = "english",
#     model_dir = here("models"))

# Load udpipe model
udmodel_english <- udpipe::udpipe_load_model(file = here("models/english-ewt-ud-2.5-191206.udpipe"))

transcripts_corpora$udpipe <- transcripts$correct %>% 
  rename(
    doc_id = id_sequence,
    text = dialogue
  ) %>% 
  udpipe::udpipe(udmodel_english)

# Define environment
reticulate::use_virtualenv("r-spacyr")

# Initialize
# spacyr::spacy_download_langmodel("en_core_web_sm", force = TRUE)
spacyr::spacy_initialize("en_core_web_sm")

# Parse text
transcripts_corpora$spacyr <- transcripts_corpora$corp %>% 
    spacyr::spacy_parse(.,
        tag = TRUE,
        pos = TRUE,
        lemma = TRUE,
        entity = TRUE,
        dependency = TRUE,
        nounphrase = TRUE,
        multithread = TRUE,
        additional_attributes = c(
          "is_punct"
        )
    )

Save data

# Save complete data
qs::qsave(
    transcripts_corpora,
    file = here("local_data/transcripts-corpora_full.qs")
)

# Save udpipe corpus
qs::qsave(
    transcripts_corpora$udpipe, 
    file = here("local_data/transcripts-corpus_udpipe.qs")
)

# Save spacyr corpus
qs::qsave(
    transcripts_corpora$spacyr, 
    file = here("local_data/transcripts-corpus_spacyr.qs")
)

References

Benoit, Kenneth, and Akitaka Matsuo. 2023. Spacyr: Wrapper to the ’spaCy’ ’NLP’ Library. https://spacyr.quanteda.io.

Benoit, Kenneth, Kohei Watanabe, Haiyan Wang, Paul Nulty, Adam Obeng, Stefan Müller, and Akitaka Matsuo. 2018. “Quanteda: An r Package for the Quantitative Analysis of Textual Data.” Journal of Open Source Software 3 (30): 774. https://doi.org/10.21105/joss.00774.

Wijffels, Jan. 2023. Udpipe: Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the ’UDPipe’ ’NLP’ Toolkit. https://CRAN.R-project.org/package=udpipe.