New NAACL paper proposes a principled way to estimate the uncertainty of knowledge graph link predictions

January 29, 2025

Researchers from the Institute for Artificial Intelligence will present their work at the 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics in Albuquerque, New Mexico, April 29–May 4, 2025.

A knowledge graph is a set of relationships between entities. For example, the relationship "Alice is diagnosed with common cold" relates the entities "Alice" and "common cold" stating a diagnosis. The link prediction task consists of predicting a missing entity in a relationship. For example, the x in "Alice is diagnosed with x" can be filled with "common cold" or "lung cancer."

 

Existing methods provide a score to every entity so that one can select the first k entities with a higher score. However, they do not provide uncertainty information, are not calibrated, and lack probabilistic information.

 

In a paper accepted at the 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL 2025), "Conformalized Answer Set Prediction for Knowledge Graph Embedding," Yuqicheng Zhu, Nico Potyka, Jiarong Pan, Bo Xiong, Yunjie He, Evgeny Kharlamov, and Steffen Staab, study how many entities we need to guarantee coverage of the actual answer at a pre-defined confidence level (e.g., 90%). The more entities we need, the more uncertain the model is about its predictions.

 

Title: Conformalized Answer Set Prediction for Knowledge Graph Embedding
Authors:
Yuqicheng Zhu, Nico Potyka, Jiarong Pan, Bo Xiong, Yunjie He, Evgeny Kharlamov, and Steffen Staab
Preprint:
https://arxiv.org/pdf/2408.08248v3
Abstract: Knowledge graph embeddings (KGE) apply machine learning methods on knowledge graphs (KGs) to provide non-classical reasoning capabilities based on similarities and analogies. The learned KG embeddings are typically used to answer queries by ranking all potential answers, but rankings often lack a meaningful probabilistic interpretation - lower-ranked answers do not necessarily have a lower probability of being true. This limitation makes it difficult to quantify uncertainty of model's predictions, posing challenges for the application of KGE methods in high-stakes domains like medicine. We address this issue by applying the theory of conformal prediction that allows generating answer sets, which contain the correct answer with probabilistic guarantees. We explain how conformal prediction can be used to generate such answer sets for link prediction tasks. Our empirical evaluation on four benchmark datasets using six representative KGE methods validates that the generated answer sets satisfy the probabilistic guarantees given by the theory of conformal prediction. We also demonstrate that the generated answer sets often have a sensible size and that the size adapts well with respect to the difficulty of the query.

To the top of the page