Most annotated biomedical datasets and resources are available in English, leaving a significant gap for languages like German. This lack of data hampers the development of biomedical AI models for languages other than English. Our research, titled "The Aluminum Standard: Using Generative Artificial Intelligence Tools to Synthesize and Annotate Non-Structured Patient Data," recently published in BMC Medical Informatics and Decision Making, explores the use of generative AI to produce synthetic medical text data in German, designed to emulate authentic medical narratives closely. These narratives are carefully crafted to reflect realistic medical conditions while protecting sensitive patient data. We also consider the relationships between diseases and comorbidities to enhance the realism and clinical relevance of the synthetic narratives. By incorporating these relationships, the synthetic data can provide a nuanced representation of patient cases, which is crucial for training robust biomedical AI models.
To evaluate the quality of the synthetic data, we trained Named Entity Recognition (NER) models on these synthetic data, demonstrating promising precision up to 0.8, showcasing the potential of synthetic data in medical AI applications.
For more details, refer to the paper "The aluminum standard: using generative Artificial Intelligence tools to synthesize and annotate non-structured patient data" by Juan Diaz, Faizan Mustafa, Felix Weil, Yi Wang, Kudret Kama, and Markus Knott published in BMC Med Inform Decis Mak 24, 409 (2024). https://doi.org/10.1186/s12911-024-02825-4