Unified Emotion-Dialect Speech Platform for Turkic Languages via Multi-Task High-Dimensional Modeling

Zulkhazhav A. Altaibek M. Bekmanova G. Omarbekova A. Yergesh B. Kabdylova D.
2025 Institute of Electrical and Electronics Engineers Inc.

International Conference on Computer Science and Engineering, UBMK
2025 Issue 2025 1738 - 1743 pp.

Controllable speech synthesis for low-resource languages remains difficult, especially when users require explicit emotion or dialect control without training data. We propose a practical pipeline that achieves zero-shot controllable generation by combining voice conversion (VC) with F0-contour transfer. The system uses a Diffusion-Transformer (DiT) VC backbone and a BigVGAN vocoder, while prosody is manipulated at inference time: RMVPE extracts frame-level F0; an Auto-F0-Adjust module normalizes pitch range across speakers; optional semitone pitch-shifting modulates expressivity; and dynamic time warping aligns target and source contours. The approach preserves linguistic content and timbre from the source while injecting target prosody (emotion or dialectal intonation), and it requires no additional acoustic training. Evaluated on Kazakh speech, we report mean-opinion-scores (MOS) ≥ 4.1, ABX perceptual identification rates of 77-83% for target emotion/dialect cues, and Pearson correlations between converted and reference F0 contours of 0.75-0.82. The system operates with a real-time factor of ~0.4 for short utterances and ~0.7 for longer ones, supporting interactive use. Our contributions are: (1) a deployable, training-free VC+prosody transfer pipeline; (2) an explicit F0 control interface (Auto-F0-Adjust + semitone shifting) that generalizes across speakers; and (3) an evaluation protocol combining MOS, ABX, and F0-correlation to quantify controllability. This work demonstrates a realistic path to emotion/dialect-controlled synthesis in low-resource settings.

BigVGAN , Controllable speech synthesis , dialectal intonation , diffusion model , dynamic time warping , emotional speech , F0 contour transfer , low-resource languages , prosody control , RMVPE , voice conversion

Text of the article Перейти на текст статьи

L.N. Gumilyov Eurasian National University, Department of Digital Development, Astana, Kazakhstan

L.N. Gumilyov Eurasian National University

10 лет помогаем публиковать статьи Международный издатель

Книга Публикация научной статьи Волощук 2026 Book Publication of a scientific article 2026