Parallel SLINK for big data

Goyal, Navneet; Goyal, Poonam

Please use this identifier to cite or link to this item: http://dspace.bits-pilani.ac.in:8080/jspui/handle/123456789/8134

Title:	Parallel SLINK for big data
Authors:	Goyal, Navneet Goyal, Poonam
Keywords:	Computer Science SLINK Big Data
Issue Date:	Jun-2019
Publisher:	Springer
Abstract:	The major strength of hierarchical clustering algorithms is that it allows visual interpretations of clusters through dendrograms. Users can cut the dendrogram at different levels to get desired number of clusters. A major problem with hierarchical algorithms is their quadratic runtime complexity, which limits the amount of data that can be clustered in reasonable amount of time. Also, due to its agglomerative merging process, each iteration depends on the data of all previous iterations, making it difficult to parallelize. Thus, there is a need for an efficient parallel implementation of SLINK algorithm which can scale to big data. We present a parallel SLINK algorithm, sGridSLINK, for shared memory architectures. sGridSLINK produces exactly the same dendrogram as the classical SLINK algorithm. We also present, hGridSLINK, a parallel algorithm which fully exploits a multi-core cluster system. To the best of our knowledge, there is no hybrid parallel algorithm for SLINK available in the literature. The proposed algorithms exploit spatial locality of data to reduce the number of distance calculations. Adaptive gridding is used to counter skewness in data and to ensure load balancing. Extensive experiments are carried out to establish the efficiency and scalability of proposed parallel algorithms. sGridSLINK is approximately 840 times faster than the state-of-the-art algorithm using 55 threads on a 48-core machine on a real dataset having 6 million data points. It also achieves a speedup of 47.93 over the best known sequential SLINK, GridSLINK, on a real dataset using 48 threads on a 48-core machine. hGridSLINK achieves a maximum speedup of 68.26 on a 32-node cluster (32×4 processing elements) with respect to GridSLINK. The hGridSLINK algorithm is able to cluster 200 million data points in only 1317 s (less than 22 min). No existing parallel SLINK algorithm is capable of such efficient clustering of Big Data.
URI:	https://link.springer.com/article/10.1007/s41060-019-00188-y http://dspace.bits-pilani.ac.in:8080/xmlui/handle/123456789/8134
Appears in Collections:	Department of Computer Science and Information Systems

Files in This Item:

There are no files associated with this item.

Show full item record