The Development of Small-Scale Language Models for Low-Resource Languages, with a Focus on Kazakh and Direct Preference Optimization

Kadyrbek N. Tuimebayev Z. Mansurova M. Viegas V.
May 2025 Multidisciplinary Digital Publishing Institute (MDPI)

Big Data and Cognitive Computing
2025 #9 Issue 5

Low-resource languages remain underserved by contemporary large language models (LLMs) because they lack sizable corpora, bespoke preprocessing tools, and the computing budgets assumed by mainstream alignment pipelines. Focusing on Kazakh, we present a 1.94B parameter LLaMA-based model that demonstrates how strong, culturally aligned performance can be achieved without massive infrastructure. The contribution is threefold. (i) Data and tokenization—we compile a rigorously cleaned, mixed-domain Kazakh corpus and design a tokenizer that respects the language’s agglutinative morphology, mixed-script usage, and diacritics. (ii) Training recipe—the model is built in two stages: causal language modeling from scratch followed by instruction tuning. Alignment is further refined with Direct Preference Optimization (DPO), extended by contrastive and entropy-based regularization to stabilize training under sparse, noisy preference signals. Two complementary resources support this step: ChatTune-DPO, a crowd-sourced set of human preference pairs, and Pseudo-DPO, an automatically generated alternative that repurposes instruction data to reduce annotation cost. (iii) Evaluation and impact—qualitative and task-specific assessments show that targeted monolingual training and the proposed DPO variant markedly improve factuality, coherence, and cultural fidelity over baseline instruction-only and multilingual counterparts. The model and datasets are released under open licenses, offering a reproducible blueprint for extending state-of-the-art language modeling to other under-represented languages and domains.

AI for Kazakh , causal language modeling , DPO , fine-tuning , Kazakh language model , LLaMA , low-resource languages , natural language processing (NLP) , RL

Text of the article Перейти на текст статьи

Department of AI & Big Data, Faculty of Information Technologies, Al-Farabi Kazakh National University, Almaty, 050040, Kazakhstan
Al-Farabi Kazakh National University, Almaty, 050040, Kazakhstan
Instituto de Telecomunicações, Universidade de Aveiro, Lisbon, 1049-001, Portugal

Department of AI & Big Data
Al-Farabi Kazakh National University
Instituto de Telecomunicações

10 лет помогаем публиковать статьи Международный издатель

Книга Публикация научной статьи Волощук 2026 Book Publication of a scientific article 2026