Ensemble approaches for Link Prediction

This thesis explores the application of ensemble learning techniques, specifically bagging and random forests, to improve link prediction in Knowledge Graph Embedding (KGE). By sub-sampling entities and relations from the original graph and aggregating scores from multiple KGE models, it aims to enhance the performance of TransE, resulting in TransErand and TransEbagg, which are evaluated on standard benchmarks like WN18RR, FB15k-237, and YAGO3-10.

Running Master's Thesis

A knowledge graph G is a triple (V, R, E), where V is a set of vertices representing entities, R is a set of relations, and E is a set of labelled edges representing relationships between entities, that is, E ⊆ V × R × V. Knowledge graphs are typically stored using the W3C standard RDF (Resource Description Framework) [1], which models graphs as sets of triples (h, r, t) where head entity and tail entity h, t ∈ V and relation r ∈ R.

A relevant problem for knowledge graphs is predicting unknown triples based on known ones. Knowledge graph embedding (KGE) is a prominent approach for link prediction by scoring the plausibility of links in the vector space. To perform link prediction, KGEs [3, 8 , 9] map entities h and t and relations r into elements h, r, and t in low-dimensional vector space, and score the plausibility of a link (h, r, t) using a score function on h, r, and t.
Combining the scores of different KGE models [6, 11, 10], or of several runs of the same KGE model [12], demonstrated to improve the performance of link prediction with KGE. The single models used in such approaches are executed on the entire graph G. On the other end, ensemble methods like bagging and random forests, have demonstrated that the combination of weak learners, executed on a subset of the original data, results in a stronger one.

Bagging [4] fit different base models to different randomly sampled versions of the data to encourage different models to make diverse predictions. Random forest [5] extends the previous method by randomly subsampling features in addition to data points to differentiate the base models even more. In both cases, the final prediction is given by the aggregation of the predictions made by the base models. However, to the best of our knowledge, the idea underlying ensemble models, to derive scores on subsets of datasets, was never applied to link prediction with KGE. In this thesis, you will re-use the underlying idea of bagging and random forest and translate it into an ensemble of KGE embeddings. In link prediction settings, the vertices V constitute the data points, while the features are the relations R.  Thus, the bagging setting can be reproduced by subsampling the vertices of the graph G, while the random forest can be reproduced by subsampling both the vertices and the relations. Using the KGEmb framework, you will:
1. Subsample the entities and/or the relations of the original graph G;
2. Run a KGE model (e.g. TransE) on each subsampled graph;
3. Aggregate the scores provided by each model to compute the final score

Finally, you will obtain the random forest and the bagging version of TransE, respectively TransErand and TransEbagg. Given the advantages of ensemble models, we conjecture that TransErand and TransEbagg will outperform the original version of TransE.
You will evaluate the methods with link prediction tasks on standard benchmarks like WN18RR [3], FB15k-237 [2] and YAGO3-10 [7].

References

[1] D. Beckett. RDF/XML Syntax Specification. W3C TR, 2 2004.
[2] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 1247–1250, 2008.
[3] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko. Translating embeddings for modeling multi relational data. Advances in neural information processing systems, 26, 2013.
[4] L. Breiman. Bagging predictors. Machine learning, 24(2):123–140, 1996.
[5] L. Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
[6] D. Krompaß and V. Tresp. Ensemble solutions for link-prediction in knowledge graphs. In Proceedings of the 2nd Workshop on Linked Data for Knowledge Discovery, Porto, Portugal, pages 1–10, 2015.
[7] F. Mahdisoltani, J. Biega, and F. Suchanek. Yago3: A knowledge base from multilingual wikipedias. In 7th biennial conference on innovative data systems research. CIDR Conference, 2014.
[8] Z. Sun, Z.-H. Deng, J.-Y. Nie, and J. Tang. Rotate: Knowledge graph embedding by relational rotation in complex space. arXiv preprint arXiv:1902.10197, 2019.
[9] T. Trouillon, J. Welbl, S. Riedel, ́E. Gaussier, and G. Bouchard. Complex embeddings for simple link prediction. In International conference on machine learning, pages 2071–2080. PMLR, 2016.
[10] Y. Wang, Y. Chen, Z. Zhang, and T. Wang. A probabilistic ensemble approach for knowledge graph embedding. Neurocomputing, 2022.
[11] Y. Wang, R. Gemulla, and H. Li. On multi-relational link prediction with bilinear models. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
[12] C. Xu, M. Nayyeri, S. Vahdati, and J. Lehmann. Multiple run ensemble learning with low-dimensional knowledge graph embeddings. In 2021 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2021.

Supervisors

To the top of the page