An Integrated Approach to Adapting Open-Source AI Models for Machine Translation of Low-Resource Turkic Languages
Tukeyev U. Shormakova A. Karibayeva A. Rakhimova D. Abduali B. Amirova D. Rakhmanberdi N. Aliyev R.
February 2026Multidisciplinary Digital Publishing Institute (MDPI)
Computers
2026#15Issue 2
This study presents the application of free, open-source artificial intelligence (AI) techniques to advance machine translation for low-resource Turkic languages such as Kazakh, Azerbaijani, Kyrgyz, Turkish, Turkmen, and Uzbek. This machine translation problem for Turkic languages is part of a project to generate meeting minutes from speech transcripts. Due to limited parallel corpora and underdeveloped linguistic tools for these languages, traditional machine translation approaches often underperform. The goal is to reduce digital inequality for these languages and to support scalability. We investigate the effectiveness of free open-source pre-trained specialized and general-purpose AI models for morphologically rich state Turkic languages. This research includes developing parallel corpora for six Turkic languages, fine-tuning, and performance evaluation using BLEU, WER, TER, and chrF metrics. The parallel corpora for five pair languages, each of 300,000 and 500,000 sentences, were generated and cleaned. The results for corpora 500,000 parallel sentences show significant improvements compared with baseline NLLB-200 1.3B on average: BLEU increased by 23.81 points, chrF increased by 26.05 points, and WER and TER decreased by 0.36 and 33.95, respectively, after cleaning and fine-tuning. Six Turkic-language multilingual parallel corpora of 3 885 542 sentences were developed and the fine-tuning of NLLB-200 1.3B shows the following, compared with the results for 500,000 cleaned corpus: BLEU increased by 4.3 points, chrF increased by 1.7 points, and WER and TER decreased by 0.1 and 4.75, respectively These results demonstrate the high efficiency of corpus cleaning and synthetic data generation to improve the quality of machine translation for low-resource Turkic languages using AI models. These results were confirmed by external evaluation on the FLORES 200 dataset and human evaluation. The scientific contribution of this article is the development of a methodology for generating parallel corpora using a specialized AI model of machine translation and fine-tuning the specialized AI model on the created corpora, creating new multilingual parallel corpora of Azerbaijan–Kazakh, Kyrgyz–Kazakh, Turkish–Kazakh, Turkmen–Kazakh, and Uzbek–Kazakh pairs using the proposed methodology, cleaning them, and conducting fine-tuning experiments.
AI-driven solutions , artificial intelligence , digital inequality , low-resource , machine translation , Turkic languages
Text of the article Перейти на текст статьи
Faculty of Information Technology and Artificial Intelligence, Farabi University, Almaty, 050040, Kazakhstan
Faculty of Information Technology and Artificial Intelligence
10 лет помогаем публиковать статьи Международный издатель
Книга Публикация научной статьи Волощук 2026 Book Publication of a scientific article 2026