Extracting and Segmenting High-Variance References from PDF Documents with BERT

This thesis primarily focuses on using neural-based embeddings (BERT) to extract references from PDF Documents
and to classify them into bibliographic components like author, title, and other records.

Completed Bachelor Thesis

Doing research requires using knowledge of prior academic papers. Thereby, it is important to acknowledge the scientific contribution of previous research work. By doing so, a smooth evolution in scientific fields is guaranteed. This is usually done by citation and reference sections. However, in some scientific fields citation styles and reference locations vary highly from the standard practice. For this purpose, automatic reference recognition can be very helpful for detecting high-variance ref-erences. In this thesis we propose to use neural-based embeddings for classifying lines into "reference/not reference" and segmenting reference string candidates. We use Bidirectional Encoder Representations of Transformer (BERT) [1] to accomplish this task. BERT is an innovative natural language model with state-of-the-art per-formance on a wide variety of tasks such as question answering, natural language inference and others. The two main reasons for choosing BERT over other models for this task is that 1) BERT is trained on a large amount of data and 2) BERT word-embeddings are dependent on the context of a particular word. On top of BERT we add Conditional Random Field [2] as additional layer to classify text spans into pre-defined classes. The effectiveness of the proposed BERT system will mainly be evaluated on the proposed datasets PGS and PES of the EXgoldstandard, which con-tain high-variance references of social science publications.

University Library Record

Error rendering list of publications


To the top of the page