Optical Character Recognition in the Web

Running Master Thesis

Contents on the Web are still dominated by text. Information retrieval and accessibility tools need to access this vast amount of text on the Web. However, texts on the Web can be implemented in various ways. Text can be stored in the HTML document, embedded to images and videos, or rendered on the fly inside a canvas using JavaScript.

This makes it difficult to gather texts on the Web. In this thesis, the text from the Web will be extracted as renderings of Web pages. The rendering will be given to a to-be-developed algorithm, which returns the text fragments and their bounding boxes on the provided rendering. For this, state-of-the-art optical character recognition approaches should be surveyed and text rendering on modern computers must be understood. Then, a custom algorithm for text recognition using deep learning must be implemented. The custom algorithm is then compared to other approaches like Tesseract [1] or Windows OCR [2]. The goal of the algorithm is to detect English text (i) accurately and (ii) fast. Also, a benchmarking framework can be used to evaluate the custom algorithm [3].

Applicants should have proficient knowledge in programming (C++ or Python) and basic knowledge in machine learning or deep learning approaches. If you are interested in this topic, please contact Raphael Menges.

[1] R. Smith, "An Overview of the Tesseract OCR Engine," Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), Curitiba, Brazil, 2007, pp. 629-633, doi: 10.1109/ICDAR.2007.4376991.

[2] https://blogs.windows.com/windowsdeveloper/2016/02/08/optical-character-recognition-ocr-for-windows-10

[3] https://github.com/Drizzy3D/CGRE

Further resources to be considered:

- Jiangying Zhou, Daniel P. Lopresti, Zhibin Lei: OCR for World Wide Web images. Document Recognition 1997: 58-66

- Einsele, F., Hennebert, J., & Ingold, R. (2007). Towards identification of very low resolution, anti-alaised characters. 2007 9th International Symposium on Signal Processing and Its Applications, 1-4.

- Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2016. Reading Text in the Wild with Convolutional Neural Networks. Int. J. Comput. Vision 116, 1 (January 2016), 1–20. DOI:https://doi.org/10.1007/s11263-015-0823-z

- T. Chattopadhyay, R. Jain and B. B. Chaudhuri, "A novel low complexity TV video OCR system," Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012), Tsukuba, Japan, 2012, pp. 665-668.

- Haojin Yang, Cheng Wang, Christian Bartz, and Christoph Meinel. 2016. SceneTextReg: A Real-Time Video OCR System. In Proceedings of the 24th ACM international conference on Multimedia (MM '16). Association for Computing Machinery, New York, NY, USA, 698–700. DOI:https://doi.org/10.1145/2964284.2973811

- Abdulkarim Khormi, Mohammad Alahmadi, and Sonia Haiduc. 2020. A Study on the Accuracy of OCR Engines for Source Code Transcription from Programming Screencasts. In Proceedings of the 17th International Conference on Mining Software Repositories (MSR '20). Association for Computing Machinery, New York, NY, USA, 65–75. DOI:https://doi.org/10.1145/3379597.3387468