Completed Bachelor Thesis
Entity recognition is an important prerequisite when searching for information in large repositories of text. Texts from the biomedical domain come with additional challenges, since new medical terms and drug names are coined on a daily basis, so a robust NER system must be able to recognize mentions of entities that it has never seen before.
The goal of this bachelor thesis project is to build a Python-based NER system for the biomedical domain starting from an existing Java-based implementation based on LTSMs and letter-trigrams [1], but extending it to use pre-trained letter-ngram representations [2].
The quality of the system will be evaluated against existing implementations on publicly available benchmarks: e.g. MedMentions [3], the largest publicly available biomedical dataset, where 42% of the entities in the test set are not seen at training time.
[1] Sebastian Arnold, Felix A. Gers, Torsten Kilias, Alexander Löser: Robust Named Entity Recognition in Idiosyncratic Domains. arXiv 2016. https://arxiv.org/abs/1608.06757
[2] Piotr Bojanowski, Edouard Grave, Armand Joulin, Tomas Mikolov: Enriching Word Vectors with Subword Information. TACL 2017. https://www.aclweb.org/anthology/Q17-1010/
[3] Sunil Mohan, Donghui Li. MedMentions: A Large Biomedical Corpus Annotated with UMLS Concepts. AKBC 2019. https://openreview.net/pdf?id=SylxCx5pTQ