Restoring Punctuation and Capitalization in Kazakh: A BERT-Based Approach for Text Normalization


Umbet S. Kozhirbayev Z.
2025Institute of Electrical and Electronics Engineers Inc.

International Conference on Computer Science and Engineering, UBMK
2025Issue 2025326 - 330 pp.

This paper introduces a punctuation and capitalization (PC) restoration model for Kazakh, developed using the bert-base-multilingual-uncased model within NVIDIAs NeMo framework. The model was trained on a curated dataset of preprocessed Kazakh Wikipedia articles and digitized books, stripped of irrelevant symbols and data. It demonstrates strong performance in predicting punctuation marks and word casing, achieving micro-average F1 scores of 98.04 for capitalization and 96.75 for punctuation. However, it struggles with rare markers, such as exclamation marks (F1: 32.85). While capitalization tasks show consistent success, punctuation restoration, particularly for less frequent markers, remains challenging. This model lays a solid foundation for improving automatic speech recognition (ASR) outputs and advancing natural language processing (NLP) applications in Kazakh, contributing to better post-processing and text normalization for low-resource languages.

BERT-based model , capitalization , Kazakh language , low-resource languages , punctuation restoration

Text of the article Перейти на текст статьи

University of Trier, Trier, Germany
Nazarbayev University, National Laboratory Astana, Astana, Kazakhstan

University of Trier
Nazarbayev University

10 лет помогаем публиковать статьи Международный издатель

Книга Публикация научной статьи Волощук 2026 Book Publication of a scientific article 2026