How to Use K-means for Big Data Clustering?
Mussabayev R. Mladenovic N. Jarboui B. Mussabayev R.
May 2023Elsevier Ltd
Pattern Recognition
2023#137
K-means plays a vital role in data mining and is the simplest and most widely used algorithm under the Euclidean Minimum Sum-of-Squares Clustering (MSSC) model. However, its performance drastically drops when applied to vast amounts of data. Therefore, it is crucial to improve K-means by scaling it to big data using as few of the following computational resources as possible: data, time, and algorithmic ingredients. We propose a new parallel scheme of using K-means and K-means++ algorithms for big data clustering that satisfies the properties of a “true big data” algorithm and outperforms the classical and recent state-of-the-art MSSC approaches in terms of solution quality and runtime. The new approach naturally implements global search by decomposing the MSSC problem without using additional metaheuristics. This work shows that data decomposition is the basic approach to solve the big data clustering problem. The empirical success of the new algorithm allowed us to challenge the common belief that more data is required to obtain a good clustering solution. Moreover, the present work questions the established trend that more sophisticated hybrid approaches and algorithms are required to obtain a better clustering solution.
Big data , Clustering , Decomposition , Divide and conquer algorithm , Global optimization , K-means , K-means++ , Minimum sum-of-squares , Unsupervised learning
Text of the article Перейти на текст статьи
Laboratory for Analysis and Modeling of Information Processes, Institute of Information and Computational Technologies, Pushkin str. 125, Almaty, 050010, Kazakhstan
Department of Mathematics, University of Washington, Padelford Hall C-138, Seattle, 98195-4350, WA, United States
Department of Industrial Management, Higher Colleges of Technology, St # 19, Abu Dhabi, 25026, United Arab Emirates
Laboratory for Analysis and Modeling of Information Processes
Department of Mathematics
Department of Industrial Management
10 лет помогаем публиковать статьи Международный издатель
Книга Публикация научной статьи Волощук 2026 Book Publication of a scientific article 2026