NMF-based approach to automatic term extraction

Nugumanova A. Akhmed-Zaki D. Mansurova M. Baiburin Y. Maulit A.
1 August 2022 Elsevier Ltd

Expert Systems with Applications
2022 #199

This work describes automatic term extraction approach based on the combination of the probabilistic topic modelling (PTM) and non-negative matrix factorization (NMF). Topic modeling algorithms including NMF-based ones do not require expensive and time-consuming manual annotations for domain terms, but only a corpus of domain documents. The topics emerge from the corpus documents without any supervision as sets of most probable words. This work is aimed to investigate how fully and precisely these most probable words from topics can reflect domain terminology. We run a series of experiments on the novel, qualitatively annotated dataset ACTER that was first used in the TermEval 2020 Shared Task. We compare five different NMF algorithms and four different NMF initializations when changing the number of topics extracted from documents and the number of most probable words extracted from topics in order to determine optimal combinations for best performance of term extraction. Finally, we compare the obtained optimal combinations of NMF with the competitive methods in TermEval 2020 and prove that our approach is second only to two much more sophisticated, domain-dependent supervised methods.

ACTER dataset , Automatic term extraction , NMF , Probabilistic topic modeling , TermEval shared task , Unsupervised term extraction

Text of the article Перейти на текст статьи

Sarsen Amanzholov East Kazakhstan University, Ust-Kamenogorsk, Kazakhstan
Astana IT University, Nur-Sultan, Kazakhstan
Al-Farabi Kazakh National University, Almaty, Kazakhstan

Sarsen Amanzholov East Kazakhstan University
Astana IT University
Al-Farabi Kazakh National University

10 лет помогаем публиковать статьи Международный издатель

Книга Публикация научной статьи Волощук 2026 Book Publication of a scientific article 2026