Running Master Thesis
Description
Given a sentence After the accident, the doctor recommended a CT scan. The text span CT could represent Computed Tomography, Cognitive Therapy, or Clinical Trial. However, it is challenging for machines to know what exact CT means. Biomedical entity linking is the task of associating spans of text, called mentions, to entities in biomedical knowledge bases (KBs) such as the UMLS (Mohan and Li, 2019). This procedure aids in clarifying ambiguities inherent in human languages by connecting them to a unique identifier in a knowledge base. In the previous example, by linking the mention CT with its corresponding unique identifier in UMLS - refer to UMLS CUI - C0040405, we can know the mentioned CT refers to Computed Tomography.
Modern entity linking systems typically feature two main components: a candidate generation module and a mention disambiguation module. The former identifies potential entities from a knowledge base matching a text mention, while the latter evaluates and finalizes the best match among these candidates. The current state-of-the-art entity linker BLINK (Wu et al., 2020) employs a bi-encoder system consisting of two independent Transformers Vaswani et al. (2017) for the candidate generation step. In the mention disambiguation step, they employ another Transformer to encode a combination of the mention and entity descriptions, which are the textual descriptions of the potential matching entities for a mention. This Transformer is called a cross-encoder because it allows cross attention between mentions and entities in a knowledge base.
One of the thesis goals is to retrain the BLINK model in the biomedical domain. The experiments will be conducted parallel to two datasets: Medmention (Mohan and Li, 2019) and Wiki-Med (Vashishth et al., 2021). You also need to compare your model with other biomedical entity linkers based on other model structures like SpaCy Neumann et al. (2019), and BioGenEL (Yuan et al., 2022). As an optional challenge, you need to train a biomedical entity linker for German. However, the task is more challenging for the German language. Because there is a lack of entity descriptions for German in UMLS.
References
- Mohan, S. and Li, D. (2019). Medmentions: A large biomedical corpus annotated with UMLS concepts. In 1st Conference on Automated Knowledge Base Construction, AKBC 2019, Amherst, MA, USA, May 20-22, 2019.
- Neumann, M., King, D., Beltagy, I., and Ammar, W. (2019). Scispacy: Fast and robust models for biomedical natural language processing. In Demner-Fushman, D., Cohen,
K. B., Ananiadou, S., and Tsujii, J., editors, Proceedings of the 18th BioNLP Workshop and Shared Task, BioNLP@ACL 2019, Florence, Italy, August 1, 2019, pages 319–327. Association for Computational Linguistics. - Vashishth, S., Newman-Griffis, D., Joshi, R., Dutt, R., and Ros´e, C. P. (2021). Improving broad-coverage medical entity linking with semantic type prediction and large-scale datasets. J. Biomed. Informatics, 121:103880.
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention is all you need. In Guyon, I., von Luxburg, U., Bengio, S., Wallach, H. M., Fergus, R., Vishwanathan, S. V. N., and Garnett, R., editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008.
- Wu, L., Petroni, F., Josifoski, M., Riedel, S., and Zettlemoyer, L. (2020). Scalable zero-shot entity linking with dense entity retrieval. In Webber, B., Cohn, T., He, Y., and Liu, Y., editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pages 6397–6407. Association for Computational Linguistics.
- Yuan, H., Yuan, Z., and Yu, S. (2022). Generative biomedical entity linking via knowledge base-guided pre-training and synonyms-aware fine-tuning. In Carpuat, M., de Marneffe, M.-C., and Meza Ruiz, I. V., editors, Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4038–4048, Seattle, United States. Association for Computational Linguistics.