B.Sc. Thesis: Collection of a German Biomedical Text Corpus from Public Sources

May 31, 2022

Recent successes in Natural Language Processing (NLP) are based on pre-training language models on large datasets of unlabelled text. In the medical domain however, such large datasets are hard to acquire. Especially in the German Medical domain, very few public text datasets are available which limits the availability of pre-trained language models and therefore the success of NLP in this domain. Instead fo relying on clinical texts that are typically hard to acquire due to privacy issues, we will therefore create a large German Medical corpus based on public sources. These public sources include academic publications and book, dissertations (e.g. from the university library), and online sources like Wikipedia. Additionally, we will create a smaller paired German-English corpus from bilingual thesauri and ontologies like the Unified Medical Language System (UMLS).

Your tasks include

Selection of sources for the corpus
Automatic extraction from the sources (e.g. web crawling, text/section extraction from pdf-files, extraction from structured files like csv or xml) and data cleaning (if required)
Definition of the target structure for the corpus and integrating of all selected sources into this structure
Explorative data analysis, i.e. computing dataset statistics and visualising properties of the dataset

Requirements

Advanced programming skills in Python
Experience in web crawling or data collection is preferable but not required

Note: Experience with machine learning is NOT required.

B.Sc. Thesis: Collection of a German Biomedical Text Corpus from Public Sources

Your tasks include

Requirements

Philip Müller

PhD Student