QURMA: A TABLE EXTRACTION PIPELINE FOR KNOWLEDGE BASE POPULATION
Nugumanova A.B. Apayev K.S. Baiburin Y.M. Mansurova M. Ospan A.
24 June 2022al-Farabi Kazakh State National University
KazNU Bulletin. Mathematics, Mechanics, Computer Science Series
2022#114Issue 291 - 100 pp.
This paper is proposed a pipeline aimed at automatically extracting tables from heterogeneous Web sources, such as HTML pages, pdf files and images. Table extraction is one of the actively developing areas of Information Extraction, for which many applications, libraries and frameworks are currently being developed. Nevertheless, most of these tools are focused on solving some specific tasks, for example, only on recognizing tables presented in the form of images. We propose to combine these tasks into a single pipeline that will support the full cycle of table processing – from the stages of their search, recognition and extraction to the stages of semantic analysis and interpretation, that is, the understanding of tables. Understanding tables and population of knowledge bases (knowledge graphs) with meaningful information contained in these tables is the ultimate goal of our design. The first part of the work presents methods for detecting tables on web pages, in pdf documents, as well as methods for automatically detecting attributes and values of objects. The second part presents the conceptual architecture of the Qurma system, designed to extract tables from heterogeneous sources on the Internet. The results section provides an example of a parser that parses the input resource type and passes control to one of the table lookup modules. Next, an operation is performed to determine the main column and link the entities contained in the main column with the corresponding categories in the external knowledge base.
knowledge base population , table extraction , table recognition , table understanding , web tables
Text of the article Перейти на текст статьи
Sarsen Amanzholov East Kazakhstan University, Ust-Kamenogorsk, Kazakhstan
D. Serikbayev East Kazakhstan Technical University, Ust-Kamenogorsk, Kazakhstan
al-Farabi Kazakh National University, Almaty, Kazakhstan
Sarsen Amanzholov East Kazakhstan University
D. Serikbayev East Kazakhstan Technical University
al-Farabi Kazakh National University
10 лет помогаем публиковать статьи Международный издатель
Книга Публикация научной статьи Волощук 2026 Book Publication of a scientific article 2026