Development of Deep Learning Methods for Visual Document Classification Using Hybrid Vision Transformer–EfficientNet Architecture

Ashimgaliyev M. Raja Joseph E. Zhumadillayeva A. Baimakhanova A.
2026 Institute of Electrical and Electronics Engineers Inc.

IEEE Access
2026 #14 28041 - 28053 pp.

The rapid expansion of digital archives and scanned document collections has underscored the importance of reliable and efficient document classification techniques. Traditional methods that combine optical character recognition (OCR) with classical machine learning often fall short when processing diverse, low-quality, or unstructured archival documents. In response, this study introduces a hybrid deep learning framework that merges a Vision Transformer (ViT) with EfficientNet for classifying visual documents. The EfficientNet component captures detailed local features, while the ViT component focuses on broader contextual information. These complementary insights are unified through a feature fusion mechanism, resulting in improved classification accuracy. Tested on a dataset of archival materials, the HybridViT model reached an overall accuracy of 98.2%, surpassing standard CNN (92.3%) and standalone ViT (94.1%) models. Additionally, both precision and recall saw gains of around 3–5%, and the model demonstrated enhanced resilience to noise and distortions. A prototype information system was also created to incorporate the classification engine into a user-friendly interface backed by a structured database. These findings highlight the promise of hybrid transformer - CNN models in pushing forward the automation of document classification in digital repositories and enhancing access to extensive archival datasets. Unlike earlier YOLO-based methods that concentrated on natural imagery or artificial document datasets, this research specifically addresses manually scanned archival documents. It conducts a structured comparison of YOLOv4, YOLOv5, and YOLOv8 using a unified training setup, evaluating both detection metrics and deployment-relevant factors on real archival scan data.

complex document classification , EfficientNet , multi-headed neural network , optical character reader , Transformers , VGG19

Text of the article Перейти на текст статьи

L. N. Gumilyov Eurasian National University, Faculty of Information Technologies, Astana, Kazakhstan
Faculty of Engineering and Technology, Multimedia University, Melaka, Malaysia
Khoja Akhmet Yassawi International Kazakh-Turkish University, Department of Computer Engineering, Turkistan, Kazakhstan

L. N. Gumilyov Eurasian National University
Faculty of Engineering and Technology
Khoja Akhmet Yassawi International Kazakh-Turkish University

10 лет помогаем публиковать статьи Международный издатель

Книга Публикация научной статьи Волощук 2026 Book Publication of a scientific article 2026