When Data Is Scarce: Training a Kazakh Speech Language Model from Discrete Units


Kairatuly B. Mansurova M.
February 2026Multidisciplinary Digital Publishing Institute (MDPI)

Applied Sciences (Switzerland)
2026#16Issue 4

This research explores the development of a decoder-only speech language model (SLM) for Kazakh, a language currently characterized by limited computational resources. Our approach leverages discrete acoustic units synthesized from self-supervised speech representations. Specifically, we utilize a pretrained Wav2Vec 2.0 model to extract continuous latent features, which are then transformed into discrete semantic tokens via the k-means clustering algorithm. These tokens serve as the foundation for training a generative model designed to predict and maximize the likelihood of speech-unit sequences. To facilitate this study, we curated a specialized Kazakh speech corpus by synthesizing and refining multiple publicly available audio datasets. Given the constrained hardware resources available, we conducted large-scale feature extraction and tokenization to train the unit-based model. We evaluated the system’s efficacy using negative log-likelihood and perplexity metrics on independent test sets. The model captures Kazakh vowel harmony but struggles with long-range agglutinative chains. Key observations include the model’s high sensitivity to data quality, tokenization techniques, and specific training hyperparameters. Although constrained by data volume and training time relative to global benchmarks, the model successfully captures the underlying structural patterns in Kazakh speech. This work establishes a vital empirical baseline and suggests future improvements through refined unit discovery and integrated speech-text modeling.

speech language modeling , unit-based language models , Wav2Vec 2.0

Text of the article Перейти на текст статьи

Faculty of Information Technology and Artificial Intelligence, Al-Farabi Kazakh National University, Almaty, 050040, Kazakhstan

Faculty of Information Technology and Artificial Intelligence

10 лет помогаем публиковать статьи Международный издатель

Книга Публикация научной статьи Волощук 2026 Book Publication of a scientific article 2026