TopicBank: Collection of coherent topics using multiple model training with their further use for topic model validation


Alekseev V. Egorov E. Vorontsov K. Goncharov A. Nurumov K. Buldybayev T.
September 2021Elsevier B.V.

Data and Knowledge Engineering
2021#135

Probabilistic topic modeling of a text collection is a tool for unsupervised learning of the inherent thematic structure of the collection. Given only the text of documents as input, the topic model aims to reveal latent topics as probability distributions over words. The shortcomings of topic models are that they are unstable in the sense that topics may depend on the random initialization, and incomplete in the sense that each new run of the model on the same collection may discover some new topics. This means that data exploration using topic modeling usually requires too many experiments for looking over many topic models and tuning their parameters in search of a model that describes the data best. To deal with the instability and incompleteness of topic models, we propose to gradually accumulate interpretable topics in a “topic bank” using multiple model training. To add topics into the bank, we learn a child level in a hierarchical topic model, then we analyze the coherence of child subtopics and their relationships with parent bank topics in order to exclude irrelevant and duplicate subtopics instead of adding them to the bank. Then we introduce a new way to topic model evaluation by comparing the topics found by the model with the ones that were collected beforehand in a bank. Our experiments with several datasets and topic models show that the proposed method does help in finding a model with more interpretable topics.

Multiple model training , Regularization , Stability , Topic coherence , Topic modeling

Text of the article Перейти на текст статьи

Moscow Institute of Physics and Technology (National Research University), 9 Institutskiy per., Dolgoprudny, Russian Federation
Information-Analytical Center, JSC, Nur-Sultan, Kazakhstan

Moscow Institute of Physics and Technology (National Research University)
Information-Analytical Center

10 лет помогаем публиковать статьи Международный издатель

Книга Публикация научной статьи Волощук 2026 Book Publication of a scientific article 2026