Session 4

📚 From Pretest to Analysis

Overview

In this session we move from fieldwork / pretest to analysis planning with a focus on digital behavioral data and text/image/geo data as extensions of classic survey data.

Participate

🖥️ Session 04

Useful resources

about content analysis in general

Krippendorff, K. (2019). Content analysis: An introduction to its methodology. SAGE Publications, Inc. https://doi.org/10.4135/9781071878781
Neuendorf, K. A. (2017). The content analysis guidebook. SAGE Publications, Inc. https://doi.org/10.4135/9781071802878
Rössler, P. (2017). Inhaltsanalyse (3., völlig überarbeitete Auflage). UVK Verlagsgesellschaft mbH mit UVK/Lucius.

about text as data & automated text analysis (with relation to surveys)

Grimmer, J., Roberts, M. E., & Stewart, B. M. (2022). Text as data: A new framework for machine learning and the social sciences. Princeton University Press.
Maier, D., Waldherr, A., Miltner, P., Wiedemann, G., Niekler, A., Keinert, A., Pfetsch, B., Heyer, G., Reber, U., Häussler, T., Schmid-Petri, H., & Adam, S. (2018). Applying LDA Topic Modeling in Communication Research: Toward a Valid and Reliable Methodology. Communication Methods and Measures, 12(2-3), 93–118. https://doi.org/10.1080/19312458.2018.1430754
Roberts, M. E., Stewart, B. M., Tingley, D., Lucas, C., Leder-Luis, J., Gadarian, S. K., Albertson, B., & Rand, D. G. (2014). Structural Topic Models for Open-Ended Survey Responses. American Journal of Political Science, 58(4), 1064–1082. https://doi.org/10.1111/ajps.12103

about embeddings

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding (J. Burstein, C. Doran, & T. Solorio, Eds.; p. 41714186). Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1423
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. https://doi.org/10.48550/ARXIV.1301.3781

about LLMs as coders (assistance)

Ashwin, J., Chhabra, A., & Rao, V. (2025). Using Large Language Models for Qualitative Analysis can Introduce Serious Bias. Sociological Methods & Research. https://doi.org/10.1177/00491241251338246
Gilardi, F., Alizadeh, M., & Kubli, M. (2023). ChatGPT outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences, 120(30). https://doi.org/10.1073/pnas.2305016120

about image analysis (OCR, object detection, multimodal)

Smith, R. (2007). An overview of the tesseract OCR engine. Ninth International Conference on Document Analysis and Recognition (ICDAR 2007) Vol 2, 629–633. https://doi.org/10.1109/icdar.2007.4376991
Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2015). You only look once: Unified, real-time object detection. https://doi.org/10.48550/ARXIV.1506.02640
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning transferable visual models from natural language supervision. https://doi.org/10.48550/ARXIV.2103.00020

about geospatial data & ESM (geofencing & GEMA)

Haas, G.-C., Mark, Keusch, F., & Bähr, S. (2020). Using Geofences to Collect Survey Data: Lessons Learned From the IAB-SMART Study. SMIF. https://doi.org/10.13094/SMIF-2020-00023
Kingsbury, C., Buzzi, M., Chaix, B., Kanning, M., Khezri, S., Kiani, B., Kirchner, T. R., Maurel, A., Thierry, B., & Kestens, Y. (2024). STROBE-GEMA: a STROBE extension for reporting of geographically explicit ecological momentary assessment studies. Archives of Public Health, 82(1). https://doi.org/10.1186/s13690-024-01310-8
Zhang, Y., Li, D., Li, X., Zhou, X., & Newman, G. (2024). The integration of geographic methods and ecological momentary assessment in public health research: A systematic review of methods and applications. Social Science & Medicine, 354, 117075. https://doi.org/10.1016/j.socscimed.2024.117075

Useful tools

R packages

tidyverse, stringr, tidyr – cleaning, reshaping, string ops
quanteda, quanteda.textstats, quanteda.textmodels – text as data workflows (DFM, dictionaries, stats)
tidytext – tidy text mining, lexicons, token workflows
stm – structural topic models (STM), incl. open-ended responses
udpipe, spacyr – tokenization, POS-tagging, lemmatization
text2vec – embeddings + similarity/clustering pipelines
sf, lwgeom, geosphere – spatial data + distances/geometry
tmap, leaflet – interactive/static maps for exploration

Python packages

pandas, numpy – data wrangling
scikit-learn – vectorization, clustering, evaluation
spacy, nltk – NLP preprocessing
gensim – topic models + word embeddings
sentence-transformers, transformers – embeddings + LLM/Transformer tooling
bertopic – topic modeling on embeddings
pytesseract + opencv-python – OCR pipelines + image preprocessing
torch, torchvision – deep learning base stack (if needed)
ultralytics (YOLO) – object detection workflows
geopandas, shapely, pyproj – geodata + projections

Back to course schedule ⏎