Benchmarking Federated Few-Shot Learning for Video-Based Action Recognition


Tu N.A. Aikyn N. Makhanov N. Abu A. Wong K.-S. Lee M.-H.
2024Institute of Electrical and Electronics Engineers Inc.

IEEE Access
2024#12193141 - 193164 pp.

Few-shot action recognition aims to train a model to classify actions in videos using only a few examples, known as shots,per action class. This learning approach is particularly useful but challenging due to the limited availability of labeled video data in practice. Although significant progress has been made in developing few-shot learners, existing methods still face several limitations. Firstly, current methods have not sufficiently explored the effectiveness of 3D feature extractors (e.g., 3D CNNs or Video Transformers), thereby failing to exploit spatiotemporal dynamics in videos. Secondly, the need for a large video dataset to train the model in a centralized manner raises privacy concerns and results in high storage costs and communication overheads. Thirdly, the existing solutions based on local deployment lack the capability to benefit global prior knowledge from a wide variety of real-world action samples. To address these limitations, we propose a federated learning (FL) framework named FedFSLAR++ to collaboratively train few-shot learners with 3D feature extractors. Specifically, we perform few-shot action recognition tasks under FL settings, enhancing privacy protection while maintaining efficient communication and storage. Moreover, FL allows us to effectively learn meta-knowledge from a large set of action videos among heterogeneous clients. Within our framework, we establish a unified benchmark to systematically and fairly compare different components, including feature extraction, meta-learning, and FL for model update and aggregation. This type of benchmark is still lacking in the literature. Notably, we thoroughly examine six 3D CNN and Transformer models for extracting spatiotemporal video features needed to adapt to new tasks quickly during the meta-learning process. We further propose a hybrid feature extractor that combines the advantages of 3D CNNs and Transformers to produce strong video representations. Additionally, we explore three meta-learning paradigms and three FL algorithms to investigate their effectiveness and suggest the optimal choices for performance improvement. Results from extensive experiments on four action datasets verify the robustness of the FedFSLAR++ framework. Our comprehensive study provides a solid foundation for future research advancements in action recognition.

federated learning , few-shot action recognition , few-shot learning , Human action recognition , representation learning

Text of the article Перейти на текст статьи

Nazarbayev University, School of Engineering and Digital Sciences, Department of Computer Science, Astana, 010000, Kazakhstan
College of Engineering and Computer Science, VinUniversity, Hanoi, Viet Nam

Nazarbayev University
College of Engineering and Computer Science

10 лет помогаем публиковать статьи Международный издатель

Книга Публикация научной статьи Волощук 2026 Book Publication of a scientific article 2026