Byte pair encodings for knowledge graph embeddings

This thesis aims to use byte pair encodings in knowledge graph embeddings which are anticipated to significantly enhance quality due to reduced dimensionality, leveraging modern large language model tokenization.


Modern large language models use 30,000 tokens to learn large models. Current knowledge graph embeddings use millions of tokens to learn large models. We expect that clever use of tokenization found in large language models will improve knowledge graph embeddings, because of the reduced dimensionality of the problem.

In this thesis, you will build on a simple knowledge graph embedding method, RDF2Vec (Paulheim et al. 2023), which samples sequences from knowledge graphs for learning embeddings. You will modify this method to use byte pair encodings and evaluate the old and the new method with regard to their capabilities for node clustering and link prediction.

We expect that the use of byte pair encodings can tremendously improve the quality of knowledge graph embeddings.


  1. H. Paulheim, P. Ristoski, J. Portisch. Embedding knowledge graphs with RDF2vec. Springer, 2023.


To the top of the page