Text data augmentation and pre-trained Language Model for enhancing text classification of low-resource languages


Ziyaden A. Yelenov A. Hajiyev F. Rustamov S. Pak A.
2024PeerJ Inc.

PeerJ Computer Science
2024#10

Background: In the domain of natural language processing (NLP), the development and success of advanced language models are predominantly anchored in the richness of available linguistic resources. Languages such as Azerbaijani, which is classified as a low-resource, often face challenges arising from limited labeled datasets, consequently hindering effective model training. Methodology: The primary objective of this study was to enhance the effectiveness and generalization capabilities of news text classification models using text augmentation techniques. In this study, we solve the problem of working with lowresource languages using translations using the Facebook mBart50 model, as well as the Google Translate API and a combination of mBart50 and Google Translate thus expanding the capabilities when working with text. Results: The experimental outcomes reveal a promising uptick in classification performance when models are trained on the augmented dataset compared with their counterparts using the original data. This investigation underscores the immense potential of combined data augmentation strategies to bolster the NLP capabilities of underrepresented languages. As a result of our research, we have published our labeled text classification dataset and pre-trained RoBERTa model for the Azerbaijani language.

Azerbaijani language , Deep learning , Low-resource language , Machine learning , Natural language processing , Text augmentation , Text classification

Text of the article Перейти на текст статьи

Kazakh-British Technical University, Almaty, Kazakhstan
Institute of Information and Computational Technologies, Almaty, Kazakhstan
Nazarbayev University, Astana, Kazakhstan
School of Information Technologies and Engineering, ADA University, Baku, Azerbaijan

Kazakh-British Technical University
Institute of Information and Computational Technologies
Nazarbayev University
School of Information Technologies and Engineering

10 лет помогаем публиковать статьи Международный издатель

Книга Публикация научной статьи Волощук 2026 Book Publication of a scientific article 2026