A Multimodal Framework for Speech Emotion Recognition in Low-Resource Languages

Altaibek M. Zulkazhav A. Yergesh B. Bekmanova G. Aibol T.
30 September 2025 Intelligence Science and Technology Press Inc.

Journal of Artificial Intelligence and Technology
2025 #5 354 - 364 pp.

Speech emotion recognition (SER) plays a crucial role in enhancing human–computer interaction by identifying emotional states in speech. However, low-resource languages like Kazakh face challenges due to limited datasets and linguistic tools. To address this problem, we propose a novel multimodal framework, KEMO (Kazakh Emotion Multimodal Optimizer), which combines text-based semantic analysis and audio emotion recognition to leverage complementary features of linguistic and paralinguistic data. Using a Kazakh-translated version of the DAIR-AI (Contextualized Affect Representations for Emotion Recognition) dataset for text and the RAVDESS (Ryerson Audio-Visual Database of Emotional Speech and Song) dataset for audio, we have developed a system capable of classifying six emotions from text and eight emotions from audio. By integrating outputs from speech-to-text and audio-based recognition models with adaptive weighting, KEMO significantly improves the accuracy and robustness of emotion classification, providing an effective solution for SER in low-resource language scenarios.

deep learning , Kazakh language , KEMO , low-resource languages , multimodal learning , speech emotion recognition

Text of the article Перейти на текст статьи

L.N.Gumilyov Eurasian National University, Astana, Kazakhstan
Astana International University, Astana, Kazakhstan

L.N.Gumilyov Eurasian National University
Astana International University

10 лет помогаем публиковать статьи Международный издатель

Книга Публикация научной статьи Волощук 2026 Book Publication of a scientific article 2026