Text Similarity Detection in Agglutinative Languages: A Case Study of Kazakh Using Hybrid N-Gram and Semantic Models

Biloshchytska S. Tleubayeva A. Kuchanskyi O. Biloshchytskyi A. Andrashko Y. Toxanov S. Mukhatayev A. Sharipova S.
June 2025 Multidisciplinary Digital Publishing Institute (MDPI)

Applied Sciences (Switzerland)
2025 #15 Issue 12

This study presents an advanced hybrid approach for detecting near-duplicate texts in the Kazakh language, addressing the specific challenges posed by its agglutinative morphology. The proposed method combines statistical and semantic techniques, including N-gram analysis, TF-IDF, LSH, LSA, and LDA, and is benchmarked against the bert-base-multilingual-cased model. Experiments were conducted on the purpose-built Arailym-aitu/KazakhTextDuplicates corpus, which contains over 25,000 manually modified text fragments using typical techniques, such as paraphrasing, word order changes, synonym substitution, and morphological transformations. The results show that the hybrid model achieves a precision of 1.00, a recall of 0.73, and an F1-score of 0.84, significantly outperforming traditional N-gram and TF-IDF approaches and demonstrating comparable accuracy to the BERT model while requiring substantially lower computational resources. The hybrid model proved highly effective in detecting various types of near-duplicate texts, including paraphrased and structurally modified content, making it suitable for practical applications in academic integrity verification, plagiarism detection, and intelligent text analysis. Moreover, this study highlights the potential of lightweight hybrid architectures as a practical alternative to large transformer-based models, particularly for languages with limited annotated corpora and linguistic resources. It lays the foundation for future research in cross-lingual duplicate detection and deep model adaptation for the Kazakh language.

academic integrity , anti-plagiarism , combined models , intelligent analysis system , Kazakh language , near duplicates , semantic analysis , text data analysis

Text of the article Перейти на текст статьи

Department of Computational and Data Science, Astana IT University, Astana, 010000, Kazakhstan
Department of Information Technology, Kyiv National University of Construction and Architecture, Kyiv, 03037, Ukraine
Department of Computer Engineering, Astana IT University, Astana, 010000, Kazakhstan
Department of Information Control Systems and Technologies, Uzhhorod National University, Uzhhorod, 88000, Ukraine
Department of Biomedical Cybernetics, National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”, Kyiv, 03056, Ukraine
Department of System Analysis and Optimization Theory, Uzhhorod National University, Uzhhorod, 88000, Ukraine
Department of Administration, Astana IT University, Astana, 010000, Kazakhstan
Department of General Education Disciplines, Astana IT University, Astana, 010000, Kazakhstan
Center of Competence and Excellence, Astana IT University, Astana, 010000, Kazakhstan

Department of Computational and Data Science
Department of Information Technology
Department of Computer Engineering
Department of Information Control Systems and Technologies
Department of Biomedical Cybernetics
Department of System Analysis and Optimization Theory
Department of Administration
Department of General Education Disciplines
Center of Competence and Excellence

10 лет помогаем публиковать статьи Международный издатель

Книга Публикация научной статьи Волощук 2026 Book Publication of a scientific article 2026