B.Sc. Thesis: Collection of a German Biomedical Text Corpus from Public Sources

Recent successes in Natural Language Processing (NLP) are based on pre-training language models on large datasets of unlabelled text. In the medical domain however, such large datasets are hard to acquire. Especially in the German Medical domain, very few public text datasets are available which limits the availability of pre-trained language models and therefore the success of NLP in this domain. Instead fo relying on clinical texts that are typically hard to acquire due to privacy issues, we will therefore create a large German Medical corpus based on public sources. These public sources include academic publications and book, dissertations (e.g. from the university library), and online sources like Wikipedia. Additionally, we will create a smaller paired German-English corpus from bilingual thesauri and ontologies like the Unified Medical Language System (UMLS).

Your tasks include

  • Selection of sources for the corpus
  • Automatic extraction from the sources (e.g. web crawling, text/section extraction from pdf-files, extraction from structured files like csv or xml) and data cleaning (if required)
  • Definition of the target structure for the corpus and integrating of all selected sources into this structure
  • Explorative data analysis, i.e. computing dataset statistics and visualising properties of the dataset


  • Advanced programming skills in Python
  • Experience in web crawling or data collection is preferable but not required

Note: Experience with machine learning is NOT required.

Philip Müller
Philip Müller
PhD Student

My research interests include applications of multi-modal learning in radiology with focus on image and text modalities.