Session 4

📚 From Pretest to Analysis

Overview

In this session we move from fieldwork / pretest to analysis planning with a focus on digital behavioral data and text/image/geo data as extensions of classic survey data.

Participate

🖥️ Session 04

Useful resources

about content analysis in general

  • Krippendorff, K. (2019). Content analysis: An introduction to its methodology. SAGE Publications, Inc. https://doi.org/10.4135/9781071878781

  • Neuendorf, K. A. (2017). The content analysis guidebook. SAGE Publications, Inc. https://doi.org/10.4135/9781071802878

  • Rössler, P. (2017). Inhaltsanalyse (3., völlig überarbeitete Auflage). UVK Verlagsgesellschaft mbH mit UVK/Lucius.

about text as data & automated text analysis (with relation to surveys)

  • Grimmer, J., Roberts, M. E., & Stewart, B. M. (2022). Text as data: A new framework for machine learning and the social sciences. Princeton University Press.

  • Maier, D., Waldherr, A., Miltner, P., Wiedemann, G., Niekler, A., Keinert, A., Pfetsch, B., Heyer, G., Reber, U., Häussler, T., Schmid-Petri, H., & Adam, S. (2018). Applying LDA Topic Modeling in Communication Research: Toward a Valid and Reliable Methodology. Communication Methods and Measures, 12(2-3), 93–118. https://doi.org/10.1080/19312458.2018.1430754

  • Roberts, M. E., Stewart, B. M., Tingley, D., Lucas, C., Leder-Luis, J., Gadarian, S. K., Albertson, B., & Rand, D. G. (2014). Structural Topic Models for Open-Ended Survey Responses. American Journal of Political Science, 58(4), 1064–1082. https://doi.org/10.1111/ajps.12103

about embeddings

  • Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding (J. Burstein, C. Doran, & T. Solorio, Eds.; p. 41714186). Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1423

  • Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. https://doi.org/10.48550/ARXIV.1301.3781

about LLMs as coders (assistance)

  • Ashwin, J., Chhabra, A., & Rao, V. (2025). Using Large Language Models for Qualitative Analysis can Introduce Serious Bias. Sociological Methods & Research. https://doi.org/10.1177/00491241251338246

  • Gilardi, F., Alizadeh, M., & Kubli, M. (2023). ChatGPT outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences, 120(30). https://doi.org/10.1073/pnas.2305016120

about image analysis (OCR, object detection, multimodal)

  • Smith, R. (2007). An overview of the tesseract OCR engine. Ninth International Conference on Document Analysis and Recognition (ICDAR 2007) Vol 2, 629–633. https://doi.org/10.1109/icdar.2007.4376991

  • Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2015). You only look once: Unified, real-time object detection. https://doi.org/10.48550/ARXIV.1506.02640

  • Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning transferable visual models from natural language supervision. https://doi.org/10.48550/ARXIV.2103.00020

about geospatial data & ESM (geofencing & GEMA)

  • Haas, G.-C., Mark, Keusch, F., & Bähr, S. (2020). Using Geofences to Collect Survey Data: Lessons Learned From the IAB-SMART Study. SMIF. https://doi.org/10.13094/SMIF-2020-00023

  • Kingsbury, C., Buzzi, M., Chaix, B., Kanning, M., Khezri, S., Kiani, B., Kirchner, T. R., Maurel, A., Thierry, B., & Kestens, Y. (2024). STROBE-GEMA: a STROBE extension for reporting of geographically explicit ecological momentary assessment studies. Archives of Public Health, 82(1). https://doi.org/10.1186/s13690-024-01310-8

  • Zhang, Y., Li, D., Li, X., Zhou, X., & Newman, G. (2024). The integration of geographic methods and ecological momentary assessment in public health research: A systematic review of methods and applications. Social Science & Medicine, 354, 117075. https://doi.org/10.1016/j.socscimed.2024.117075

Useful tools

R packages

  • tidyverse, stringr, tidyr – cleaning, reshaping, string ops
  • quanteda, quanteda.textstats, quanteda.textmodels – text as data workflows (DFM, dictionaries, stats)
  • tidytext – tidy text mining, lexicons, token workflows
  • stm – structural topic models (STM), incl. open-ended responses
  • udpipe, spacyr – tokenization, POS-tagging, lemmatization
  • text2vec – embeddings + similarity/clustering pipelines
  • sf, lwgeom, geosphere – spatial data + distances/geometry
  • tmap, leaflet – interactive/static maps for exploration

Python packages

  • pandas, numpy – data wrangling
  • scikit-learn – vectorization, clustering, evaluation
  • spacy, nltk – NLP preprocessing
  • gensim – topic models + word embeddings
  • sentence-transformers, transformers – embeddings + LLM/Transformer tooling
  • bertopic – topic modeling on embeddings
  • pytesseract + opencv-python – OCR pipelines + image preprocessing
  • torch, torchvision – deep learning base stack (if needed)
  • ultralytics (YOLO) – object detection workflows
  • geopandas, shapely, pyproj – geodata + projections

Back to course schedule