Multilingual end-to-end ASR for low-resource Turkic languages with common alphabets


Bekarystankyzy A. Mamyrbayev O. Mendes M. Fazylzhanova A. Assam M.
December 2024Nature Research

Scientific Reports
2024#14Issue 1

To obtain a reliable and accurate automatic speech recognition (ASR) machine learning model, it is necessary to have sufficient audio data transcribed, for training. Many languages in the world, especially the agglutinative languages of the Turkic family, suffer from a lack of this type of data. Many studies have been conducted in order to obtain better models for low-resource languages, using different approaches. The most popular approaches include multilingual training and transfer learning. In this study, we combined five agglutinative languages from the Turkic family—Kazakh, Bashkir, Kyrgyz, Sakha, and Tatar,—in order to provide multilingual training using connectionist temporal classification and an attention mechanism including a language model, because these languages have cognate words, sentence formation rules, and alphabet (Cyrillic). Data from the open-source database Common voice was used for the study, to make the experiments reproducible. The results of the experiments showed that multilingual training could improve ASR performances for all languages included in the experiment, except Bashkir language. A dramatic result was achieved for the Kyrgyz language: word error rate decreased to nearly one-fifth and character error rate decreased to one-fourth, which proves that this approach can be helpful for critically low-resource languages.

Agglutinative languages , Attention-based , Conformer , Connectionist temporal classification , End-to-end ASR , Low-resource languages , Multilingual learning

Text of the article Перейти на текст статьи

Satbayev University, Almaty, Kazakhstan
Narxoz University, Almaty, Kazakhstan
Institute of Information and Computational Technologies, Almaty, Kazakhstan
Polytechnic Institute of Coimbra, ISEC, Coimbra, Portugal
University of Coimbra, ISR, Coimbra, Portugal
Committee of Science of the Ministry of Science and Higher Education of the RK, Institute of Linguistics and Named After Akhmet Baitursynuly, Almaty, Kazakhstan
University of Science and Technology, KP, Bannu, Pakistan

Satbayev University
Narxoz University
Institute of Information and Computational Technologies
Polytechnic Institute of Coimbra
University of Coimbra
Committee of Science of the Ministry of Science and Higher Education of the RK
University of Science and Technology

10 лет помогаем публиковать статьи Международный издатель

Книга Публикация научной статьи Волощук 2026 Book Publication of a scientific article 2026