Information Extraction from Multi-Domain Scientific Documents: Methods and Insights
Batura T. Yerimbetova A. Mukazhanov N. Shvarts N. Sakenov B. Turdalyuly M.
August 2025Multidisciplinary Digital Publishing Institute (MDPI)
Applied Sciences (Switzerland)
2025#15Issue 16
The rapid growth of scientific literature necessitates effective information extraction. However, existing methods face significant challenges, particularly when applied to multi-domain documents and low-resource languages. For Kazakh and Russian, there is a notable lack of annotated corpora and dedicated tools for scientific information extraction. To address this gap, we introduce SciMDIX (Scientific Multi-Domain Information extraction), a novel multi-domain dataset of scientific documents in Russian and Kazakh, annotated with entities and relations. Our study includes a comprehensive evaluation of entity recognition performance, comparing state-of-the-art models, such as BERT, LLaMA, GLiNER, and spaCy across four diverse domains (IT, Linguistics, Medicine, and Psychology) in both languages. The findings highlight the promise of spaCy and GLiNER for practical deployment in under-resourced language settings. Furthermore, we propose a new zero-shot relation extraction model that leverages a multimodal representation by integrating sentence context, entity mentions, and textual definitions of relation classes. Our model can predict semantic relations between entities in new documents, even for a language encountered during training. This capability is especially valuable for low-resource language scenarios.
information extraction , language models , named entity recognition , natural language processing , relation extraction , term extraction
Text of the article Перейти на текст статьи
Institute of Information and Computational Technologies, Almaty, 050010, Kazakhstan
A.P. Ershov Institute of Informatics Systems, Novosibirsk, 630090, Russian Federation
Global Education and Training, University of Illinois Urbana-Champaign, Champaign, 61801, IL, United States
Department of Software Engineering, Satbayev University, Almaty, 050013, Kazakhstan
Institute of Information and Computational Technologies
A.P. Ershov Institute of Informatics Systems
Global Education and Training
Department of Software Engineering
10 лет помогаем публиковать статьи Международный издатель
Книга Публикация научной статьи Волощук 2026 Book Publication of a scientific article 2026