Greedy Texts Similarity Mapping


Jangabylova A. Krassovitskiy A. Mussabayev R. Ualiyeva I.
8 November 2022MDPI

Computation
2022#10Issue 11

The documents similarity metric is a substantial tool applied in areas such as determining topic in relation to documents, plagiarism detection, or problems necessary to capture the semantic, syntactic, or structural similarity of texts. Evaluated results of the similarity measure depend on the types of word represented and the problem statement and can be time-consuming. In this paper, we present a problem-independent algorithm of the similarity metric greedy texts similarity mapping (GTSM), which is computationally efficient to be applied for large datasets with any preferred word vectorization models. GTSM maps words in two texts based on a decision rule that evaluates word similarity and their importance to the texts. We compare it with the well-known word mover’s distance (WMD) algorithm in the k-nearest neighbors text classification problem and find that it leads to similar or better results. In the correlation evaluation task of similarity measures with human-judged scores, we demonstrate its higher correlation scores in comparison with WMD and sentence mover’s similarity (SMS) and show that GTSM is a decent alternative for both word-level and sentence-level tasks.

k-nearest neighbors , text classification , text similarity , word embedding , word mover distance

Text of the article Перейти на текст статьи

Institute of Information and Computational Technologies, Pushkin Str., 125, Almaty, 050010, Kazakhstan
Faculty of Information Technology, Al-Farabi Kazakh National University, 71 Al-Farabi Ave., Almaty, 050040, Kazakhstan

Institute of Information and Computational Technologies
Faculty of Information Technology

10 лет помогаем публиковать статьи Международный издатель

Книга Публикация научной статьи Волощук 2026 Book Publication of a scientific article 2026