Semantic Recognition on Table Images from Visually Rich Documents
Loading...
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Université d'Ottawa / University of Ottawa
Abstract
Visually rich documents have been widely used in many scenarios because of their user-friendliness for human readers. However, with the surging number of these documents, it has been far beyond the capacity of humans to manage, extract critical information and mine useful knowledge from these documents efficiently, making it necessary to develop tools to manage and interpret these documents automatically. Typically, a document can contain different types of components, such as Regular Text areas, Tables, and Figures, and these components usually require different processing methods. Since tables are usually used to summarize vital information, and their two dimensions and complex structures make the semantic recognition challenging, this thesis focuses on the semantic recognition task on tables. With the development of Large Language Models (LLMs), applying LLMs to semantic recognition tasks has become a popular choice, as many studies have demonstrated the remarkable capacities of LLMs for semantic recognition tasks. Considering that even though multi-model LLMs can process images directly, their performance on semantic recognition tasks with images containing dense text is still far behind text-only LLMs, this thesis proposes to apply text-only LLMs on semantic recognition task on the table images from visually rich documents. Since the tables from visually rich documents are usually images, this thesis introduces a complete solution to fill the modality gap between table images and text-only LLMs, including Table Detection (TD) and Table Structures Recognition (TSR) models, and further use Table Question Answering (Table-QA) problem as an example of semantic recognition tasks. Comprehensive experiments are conducted on various datasets for models in the TD, TSR and Table-QA, and the experimental results demonstrate the superiority of the proposed solution.
Description
Keywords
Table Detection, Table Structure Recognition, Table Question Answering, Large Language Model
