Federated Aerial Video Captioning With Effective Temporal Adaptation

Tu N.A. Makhanov N. Taniyev K. Duc Do T.
2026 Institute of Electrical and Electronics Engineers Inc.

IEEE Geoscience and Remote Sensing Letters
2026 #23

Aerial video captioning (VC) facilitates the automatic interpretation of dynamic scenes in remote sensing (RS), supporting critical applications, such as disaster response, traffic monitoring, and environmental surveillance. However, challenges, such as extreme angles and continuous camera motion, require adaptive modeling of complex temporal relationships. To tackle these challenges, we leverage an image-language model as the vision encoder and introduce a temporal adaptation module that combines convolution with self-attention layers to both capture local semantics across neighboring frames and model global temporal dependencies. This design allows our model to exploit the multimodal knowledge of the vision encoder while effectively reasoning over the spatiotemporal dynamics. In addition, privacy concerns often restrict access to annotated aerial datasets, posing further challenges for model training. To address this, we develop a federated learning (FL) framework that enables collaborative model training across decentralized clients. Within this framework, we establish a unified benchmark for systematic comparison of temporal adapters, text decoders, and FL strategies, hence filling a gap in the existing literature. Extensive experiments validate the robustness of our approach and its potential for advancing aerial VC.

Aerial videos , deep learning , federated learning (FL) , language models , video captioning (VC)

Text of the article Перейти на текст статьи

Nazarbayev University, School of Engineering and Digital Sciences, Astana, 010000, Kazakhstan

Nazarbayev University

10 лет помогаем публиковать статьи Международный издатель

Книга Публикация научной статьи Волощук 2026 Book Publication of a scientific article 2026