Session 4
📚 From Pretest to Analysis
Overview
In this session we move from fieldwork / pretest to analysis planning with a focus on digital behavioral data and text/image/geo data as extensions of classic survey data.
Participate
🖥️ Session 04
Useful resources
about content analysis in general
Krippendorff, K. (2019). Content analysis: An introduction to its methodology. SAGE Publications, Inc. https://doi.org/10.4135/9781071878781
Neuendorf, K. A. (2017). The content analysis guidebook. SAGE Publications, Inc. https://doi.org/10.4135/9781071802878
Rössler, P. (2017). Inhaltsanalyse (3., völlig überarbeitete Auflage). UVK Verlagsgesellschaft mbH mit UVK/Lucius.
about text as data & automated text analysis (with relation to surveys)
Grimmer, J., Roberts, M. E., & Stewart, B. M. (2022). Text as data: A new framework for machine learning and the social sciences. Princeton University Press.
Maier, D., Waldherr, A., Miltner, P., Wiedemann, G., Niekler, A., Keinert, A., Pfetsch, B., Heyer, G., Reber, U., Häussler, T., Schmid-Petri, H., & Adam, S. (2018). Applying LDA Topic Modeling in Communication Research: Toward a Valid and Reliable Methodology. Communication Methods and Measures, 12(2-3), 93–118. https://doi.org/10.1080/19312458.2018.1430754
Roberts, M. E., Stewart, B. M., Tingley, D., Lucas, C., Leder-Luis, J., Gadarian, S. K., Albertson, B., & Rand, D. G. (2014). Structural Topic Models for Open-Ended Survey Responses. American Journal of Political Science, 58(4), 1064–1082. https://doi.org/10.1111/ajps.12103
about embeddings
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding (J. Burstein, C. Doran, & T. Solorio, Eds.; p. 41714186). Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1423
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. https://doi.org/10.48550/ARXIV.1301.3781
about LLMs as coders (assistance)
Ashwin, J., Chhabra, A., & Rao, V. (2025). Using Large Language Models for Qualitative Analysis can Introduce Serious Bias. Sociological Methods & Research. https://doi.org/10.1177/00491241251338246
Gilardi, F., Alizadeh, M., & Kubli, M. (2023). ChatGPT outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences, 120(30). https://doi.org/10.1073/pnas.2305016120
about image analysis (OCR, object detection, multimodal)
Smith, R. (2007). An overview of the tesseract OCR engine. Ninth International Conference on Document Analysis and Recognition (ICDAR 2007) Vol 2, 629–633. https://doi.org/10.1109/icdar.2007.4376991
Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2015). You only look once: Unified, real-time object detection. https://doi.org/10.48550/ARXIV.1506.02640
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning transferable visual models from natural language supervision. https://doi.org/10.48550/ARXIV.2103.00020
about geospatial data & ESM (geofencing & GEMA)
Haas, G.-C., Mark, Keusch, F., & Bähr, S. (2020). Using Geofences to Collect Survey Data: Lessons Learned From the IAB-SMART Study. SMIF. https://doi.org/10.13094/SMIF-2020-00023
Kingsbury, C., Buzzi, M., Chaix, B., Kanning, M., Khezri, S., Kiani, B., Kirchner, T. R., Maurel, A., Thierry, B., & Kestens, Y. (2024). STROBE-GEMA: a STROBE extension for reporting of geographically explicit ecological momentary assessment studies. Archives of Public Health, 82(1). https://doi.org/10.1186/s13690-024-01310-8
Zhang, Y., Li, D., Li, X., Zhou, X., & Newman, G. (2024). The integration of geographic methods and ecological momentary assessment in public health research: A systematic review of methods and applications. Social Science & Medicine, 354, 117075. https://doi.org/10.1016/j.socscimed.2024.117075
Useful tools
R packages
tidyverse,stringr,tidyr– cleaning, reshaping, string opsquanteda,quanteda.textstats,quanteda.textmodels– text as data workflows (DFM, dictionaries, stats)tidytext– tidy text mining, lexicons, token workflowsstm– structural topic models (STM), incl. open-ended responsesudpipe,spacyr– tokenization, POS-tagging, lemmatizationtext2vec– embeddings + similarity/clustering pipelinessf,lwgeom,geosphere– spatial data + distances/geometrytmap,leaflet– interactive/static maps for exploration
Python packages
pandas,numpy– data wranglingscikit-learn– vectorization, clustering, evaluationspacy,nltk– NLP preprocessinggensim– topic models + word embeddingssentence-transformers,transformers– embeddings + LLM/Transformer toolingbertopic– topic modeling on embeddingspytesseract+opencv-python– OCR pipelines + image preprocessingtorch,torchvision– deep learning base stack (if needed)ultralytics(YOLO) – object detection workflowsgeopandas,shapely,pyproj– geodata + projections
Back to course schedule ⏎