Adaptive bottleneck transformer for multimodal EEG, audio, and vision fusion


Bralina S. Yazici A. Guan C. Lee M.-H.
25 May 2026Elsevier Ltd

Expert Systems with Applications
2026#312

Facial and speech expressions are primary cues for emotion recognition, while EEG provides a complementary neural perspective when external signals are ambiguous or absent. Although each modality contributes unique affective information, integrating such heterogeneous signals remains a major challenge in multimodal fusion research. To address this, the Adaptive Multimodal Bottleneck Transformer (AMBT) is introduced as a novel architecture, enabling efficient cross-modal interaction through adapters embedded within intermediate Transformer layers. These adapters 1) enhance stability by leveraging bottleneck tokens to prevent premature collapse, 2) enrich backbone representations while preserving unimodal capacity, 3) enable seamless integration across heterogeneous Transformer architectures, and 4) enable parameter-efficient training with fewer than 1% additional trainable parameters. AMBT was evaluated on three benchmark datasets: EAV (85.1%), CREMA-D (90.9%), and DEAP (98.7%), demonstrating competitive performance across all datasets. This results demonstrate the ability of AMBT to exploit complementary multimodal signals in a computationally efficient manner.

Brain-computer interface (BCI) , Contrastive learning , EEG-Audio-Vision , Emotion recognition , Multimodal transformer

Text of the article Перейти на текст статьи

Nazarbayev University, Department of Computer Science, Astana, 010000, Kazakhstan
Nanyang Technological University, College of Computing and Data Science, Singapore, 639798, Singapore

Nazarbayev University
Nanyang Technological University

10 лет помогаем публиковать статьи Международный издатель

Книга Публикация научной статьи Волощук 2026 Book Publication of a scientific article 2026