Description
Natural language processing tools that recognize names from the biomedical domain (e.g. of diseases, symptoms, drugs, genes, proteins, etc.) and connect them to established entity identifiers (i.e. from the Unified Medical Language System - UMLS [1]) are an important resource when dealing with the large volumes of text, which are typical in this domain.
However, while several such biomedical named entity recognition (NER) and named entity linking (NEL) tools exist for English [2, 3], there are no good off-the-shelf solutions for German. This is because of the lack of corpora annotated with biomedical information for German. Such annotations are time-intensive and expensive, because they require expert knowledge.
The goal of this thesis is to create a German counterpart of the largest NER/NEL English corpus, MedMentions [4]. The MedMentions annotations focus on the biomedical entities mentioned in PubMed abstracts - e.g. the entity "cystic fibrosis" appears in the example below from position 67 to position 82. MedMentions is annotated with Concept Unique Identifiers (CUIs, e.g. C0854135 for "DCTN4 protein, human") and semantic types (TUIs, e.g. T047 for "Disease or Syndrome") from the UMLS.
The idea is to use LLMs to translate the English text into German and to match the two parallel texts, thus transferring the annotations from English to German.
While such efforts have been undertaken before (for other corpora and language pairs) using translation and alignment software [5, 6], LLMs have the potential to drastically reduce the amount of time required to create such a resource, thus helping to level the field for low-resource languages. Here is a proof-of-concept example of a sentence from MedMentions, translated and aligned by ChatGPT 5 Thinking.
The resulting annotations contain all the information needed to create the German counterpart of MedMentions: the title and abstract text translated into German, the mentions in the text together with their start and end index, as well as the corresponding CUI.
The resulting annotations should be evaluated in two ways:
- A sample should be manually checked for correction.
- The correctness of the entity names and the generated links should be cross-checked and improved upon using existing German data in the UMLS and Wikidata [7] (e.g. the UMLS CUI C0854135 has the English label 'chronic Pseudomonas aeruginosa infection' and the German label 'Infektion infolge von Pseudomonas aeruginosa').
The annotated corpus should be used for training a named entity recognition/named entity linking models starting from existing pre-trained language models for German like medBERT.de [7]. This evaluation should report results for the dev/test splits of the dataset (which should match the English splits).
The publication of a scientific article based on the work done in this thesis is encouraged.
Requirements
- good command of English and German
- good Python programming skills
References
- UMLS (Unified Medical Language System), https://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/index.html
- ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing, BioNLP 2019, Mark Neumann, Daniel King, Iz Beltagy and Waleed Ammar, http://aclanthology.lst.uni-saarland.de/W19-5034.pdf
- MedLinker: Medical Entity Linking with Neural Representations and Dictionary Matching, ECIR 2020, Daniel Loureiro and Alípio Mário Jorge, https://doi.org/10.1007/978-3-030-45442-5_29
- MedMentions: A Large Biomedical Corpus Annotated with UMLS Concepts, AKBC 2019, Sunil Mohan and Donghui Li, https://openreview.net/pdf?id=SylxCx5pTQ
- Schäfer, H., Idrissi-Yaghir, A., Horn, P., & Friedrich, C. (2022). Cross-Language Transfer of High-Quality Annotations: Combining Neural Machine Translation with Cross-Linguistic Span Alignment to Apply NER to Clinical Texts in a Low-Resource
Language. _Proceedings of the 4th Clinical Natural Language Processing
Workshop_, 53–62. https://doi.org/10.18653/v1/2022.clinicalnlp-1.6 - Jalili Sabet, M., Dufter, P., Yvon, F., & Schütze, H. (2020). SimAlign: High Quality Word Alignments Without Parallel Training Data Using Static and Contextualized Embeddings. _Findings of the Association for Computational Linguistics: EMNLP 2020_,
1627–1643. https://doi.org/10.18653/v1/2020.findings-emnlp.147 - Bressem, Keno K. et al. (2024). “medBERT.de: A Comprehensive German BERT Model for the Medical Domain”. In: Expert Systems with Applications 237, p. 121598. issn: 0957-4174. doi: 10.1016/j.eswa.2023.121598. url:
https://www.sciencedirect.com/science/article/pii/S0957417423021000. - https://github.com/allenai/scispacy
- https://github.com/medspacy/medspacy