Morphology-Aware Segmentation and Tokenization for Turkic Languages: A CSE-Guided Framework (The Kazakh Case)


Tukeyev U. Rysbek B.
February 2026Multidisciplinary Digital Publishing Institute (MDPI)

Information (Switzerland)
2026#17Issue 2

The main challenge of resource-poor languages—namely, the lack of sufficiently large and linguistically informed datasets for training neural models—is addressed in this paper by developing a dataset generation technology based on a Complete Set of Endings (CSE) morphological model for Turkic languages. Building on this technology, we propose a CSE-Guided Framework for morphology-aware statistical tokenization and neural model segmentation, with Kazakh as a case study. Applying the proposed CSE-guided approach to adapt well-known tokenizers for Kazakh leads to measurable reductions in neural model training time (up to approximately 33%) in our experimental setting, primarily due to shorter tokenized sentence lengths. In addition, we extend the SOTA FEMSeg-CRF architecture by incorporating Kazakh vowel–consonant harmony rules at the embedding generation stage. Within the proposed framework, training on a corpus of CSE-generated wordforms results in the FEMSeg_kaz_v2 model, which is evaluated using intrinsic segmentation metrics. Training on a CSE-segmented sentence corpus yields FEMSeg_kaz_v3, which is further assessed using intrinsic, extrinsic, and external evaluation on a manually prepared gold-standard dataset. The paper presents a CSE-guided framework for morphology-aware tokenization and segmentation for Turkic languages, supported by corpus construction, model extensions, and multi-level evaluation. The proposed CSE-Guided Framework can potentially be adapted for other Turkic languages.

complete set of endings (CSE) model , extrinsic evaluation , intrinsic evaluation , Kazakh case , morphology-aware segmentation and tokenization , Turkic languages

Text of the article Перейти на текст статьи

Faculty of Information Technology and Artificial Intelligence, Al-Farabi Kazakh National University, Almaty, 050040, Kazakhstan

Faculty of Information Technology and Artificial Intelligence

10 лет помогаем публиковать статьи Международный издатель

Книга Публикация научной статьи Волощук 2026 Book Publication of a scientific article 2026