Running Master Thesis
Description
Knowledge graph (KG) is a structured semantic knowledge base that describes real-life entities and their relationships to each other. The basic unit of a KG is the (head entity, relation label, tail entity) triple. Knowledge graph embedding aims to embed the entities and relation labels into a continuous, low-dimensional vector space. This embedding preserves the inherent relational structure of the KG. Such an approach can benefit various downstream tasks, such as link prediction. In the training regime of the knowledge graph embedding, the method of random sampling is usually adopted, but it might not be optimal. To enhance the efficiency and effectiveness of knowledge graph embedding training, this thesis aims to investigate several biased sampling training regimes. Entities and relation labels are stratified according to the frequency of their occurrence in the KG, with the high-frequency group having a higher probability of being sampled. Alternatively, in the segmented training process, sampling begins with the high-frequency group and then progresses to the low-frequency group. TransE, RotatE, and DistMult are used as the baseline models, with mean reciprocal rank (MRR) and Hits@N serving as the evaluation criteria. The training results of stratified sampling are compared with random sampling to analyze whether these new training regimes could enhance the training efficiency and effectiveness of the knowledge graph embedding method.