Fine-Tuning Methods and Dataset Structures for Multilingual Neural Machine Translation: A Kazakh–English–Russian Case Study in the IT Domain
Kozhirbayev Z. Yessenbayev Z.
August 2025Multidisciplinary Digital Publishing Institute (MDPI)
Electronics (Switzerland)
2025#14Issue 15
This study explores fine-tuning methods and dataset structures for multilingual neural machine translation using the No Language Left Behind model, with a case study on Kazakh, English, and Russian. We compare single-stage and two-stage fine-tuning approaches, as well as triplet versus non-triplet dataset configurations, to improve translation quality. A high-quality, 50,000-triplet dataset in information technology domain, manually translated and expert-validated, serves as the in-domain benchmark, complemented by out-of-domain corpora like KazParC. Evaluations using BLEU, chrF, METEOR, and TER metrics reveal that single-stage fine-tuning excels for low-resource pairs (e.g., 0.48 BLEU, 0.77 chrF for Kazakh → Russian), while two-stage fine-tuning benefits high-resource pairs (Russian → English). Triplet datasets improve cross-linguistic consistency compared with non-triplet structures. Our reproducible framework offers practical guidance for adapting neural machine translation to technical domains and low-resource languages.
fine-tuning setup , IT domain , Kazakh–English–Russian , multilingual machine translation , NLLB , triplet dataset
Text of the article Перейти на текст статьи
National Laboratory Astana, Nazarbayev University, Astana, 010000, Kazakhstan
Computer Science Department, School of Computing and Creative Arts, Bina Nusantara University, Jakarta, 10110, Indonesia
National Laboratory Astana
Computer Science Department
10 лет помогаем публиковать статьи Международный издатель
Книга Публикация научной статьи Волощук 2026 Book Publication of a scientific article 2026