Extracting and Segmenting High-Variance References from PDF Documents with BERT

Completed Bachelor Thesis

Doing research requires using knowledge of prior academic papers. Thereby, it is important to acknowledge the scientific contribution of previous research work. By doing so, a smooth evolution in scientific fields is guaranteed. This is usually done by citation and reference sections. However, in some scientific fields citation styles and reference locations vary highly from the standard practice. For this purpose, automatic reference recognition can be very helpful for detecting high-variance ref-erences. In this thesis we propose to use neural-based embeddings for classifying lines into "reference/not reference" and segmenting reference string candidates. We use Bidirectional Encoder Representations of Transformer (BERT) [1] to accomplish this task. BERT is an innovative natural language model with state-of-the-art per-formance on a wide variety of tasks such as question answering, natural language inference and others. The two main reasons for choosing BERT over other models for this task is that 1) BERT is trained on a large amount of data and 2) BERT word-embeddings are dependent on the context of a particular word. On top of BERT we add Conditional Random Field [2] as additional layer to classify text spans into pre-defined classes. The effectiveness of the proposed BERT system will mainly be evaluated on the proposed datasets PGS and PES of the EXgoldstandard, which con-tain high-variance references of social science publications.

University Library Record

Evci, H. (2021). Extracting and segmenting high-variance references from PDF documents with BERT. https://doi.org/10.18419/OPUS-11940
- BibTeX
- Link
BibTeX
@misc{https://doi.org/10.18419/opus-11940, abstract = {The extraction and segmentation of references from scientific articles is a core task of modern digital libraries. Once references are extracted and segmented, the bibliographic information can be made publicly available and linked, enabling efficient literature study. However, references often vary in their structure and content. This makes the extraction and segmentation of references a challenging but valuable task. The purpose of this thesis is to investigate whether Bidirectional Encoder Representations from Transformers (BERT) is suitable for the extraction and segmentation of bibliographic references. Therefore, we follow a deep learning approach for the extraction and segmentation of references from PDF documents. We use a neural network architecture based on BERT, a deep language representation model that has significantly increased performance on many natural language processing tasks. Over the BERT output, we put a linear-chain Conditional Random Field. We experiment with different BERT models and input formats and also examine two approaches for reference extraction and segmentation. The experiments are evaluated on a challenging dataset that contains both English and German social science publications with highly varying references. Our results show that the best performing BERT models were pre-trained on similar data to the data that we used for the fine-tuning of the BERT models on the task of reference extraction and reference segmentation. Moreover, our findings show that long, context-based input sequences yield the best results. The extraction model identifies and extracts references with an average F1-score of 81.9%. References are segmented with an average F1-score of 93.6%. We show that our models compare well to one other previously published work. Our conclusion is that BERT is a suitable choice for reference extraction and reference segmentation.}, author = {Evci, Hasan}, copyright = {info:eu-repo/semantics/openAccess}, doi = {10.18419/OPUS-11940}, institution = {Institute of Parallel and Distributed Systems}, language = {en}, school = {Universität Stuttgart}, title = {Extracting and segmenting high-variance references from PDF documents with BERT}, type = {Bachelor's Thesis}, url = {http://elib.uni-stuttgart.de/handle/11682/11957}, year = 2021 }
Link
http://elib.uni-stuttgart.de/handle/11682/11957

Supervisors

Anastasiia Iurshina, M.Sc.

Researcher

Prof. Dr. Steffen Staab

Managing Director

Phone: +49 711 685 88100

E-Mail

Completed Bachelor Thesis

University Library Record

BibTeX

Link

Supervisors

Anastasiia Iurshina, M.Sc.

Prof. Dr. Steffen Staab

Audience

Formalities

Services

Organization

Extracting and Segmenting High-Variance References from PDF Documents with BERT

Completed Bachelor Thesis

University Library Record

BibTeX

Link

Supervisors

Anastasiia Iurshina, M.Sc.

Prof. Dr. Steffen Staab

Here you can reach us

Audience

Formalities

Services

Organization