DSpace Repository

Parallel SLINK for big data

Show simple item record

dc.contributor.author Goyal, Navneet
dc.contributor.author Goyal, Poonam
dc.date.accessioned 2022-12-26T10:06:49Z
dc.date.available 2022-12-26T10:06:49Z
dc.date.issued 2019-06
dc.identifier.uri https://link.springer.com/article/10.1007/s41060-019-00188-y
dc.identifier.uri http://dspace.bits-pilani.ac.in:8080/xmlui/handle/123456789/8134
dc.description.abstract The major strength of hierarchical clustering algorithms is that it allows visual interpretations of clusters through dendrograms. Users can cut the dendrogram at different levels to get desired number of clusters. A major problem with hierarchical algorithms is their quadratic runtime complexity, which limits the amount of data that can be clustered in reasonable amount of time. Also, due to its agglomerative merging process, each iteration depends on the data of all previous iterations, making it difficult to parallelize. Thus, there is a need for an efficient parallel implementation of SLINK algorithm which can scale to big data. We present a parallel SLINK algorithm, sGridSLINK, for shared memory architectures. sGridSLINK produces exactly the same dendrogram as the classical SLINK algorithm. We also present, hGridSLINK, a parallel algorithm which fully exploits a multi-core cluster system. To the best of our knowledge, there is no hybrid parallel algorithm for SLINK available in the literature. The proposed algorithms exploit spatial locality of data to reduce the number of distance calculations. Adaptive gridding is used to counter skewness in data and to ensure load balancing. Extensive experiments are carried out to establish the efficiency and scalability of proposed parallel algorithms. sGridSLINK is approximately 840 times faster than the state-of-the-art algorithm using 55 threads on a 48-core machine on a real dataset having 6 million data points. It also achieves a speedup of 47.93 over the best known sequential SLINK, GridSLINK, on a real dataset using 48 threads on a 48-core machine. hGridSLINK achieves a maximum speedup of 68.26 on a 32-node cluster (32×4 processing elements) with respect to GridSLINK. The hGridSLINK algorithm is able to cluster 200 million data points in only 1317 s (less than 22 min). No existing parallel SLINK algorithm is capable of such efficient clustering of Big Data. en_US
dc.language.iso en en_US
dc.publisher Springer en_US
dc.subject Computer Science en_US
dc.subject SLINK en_US
dc.subject Big Data en_US
dc.title Parallel SLINK for big data en_US
dc.type Article en_US


Files in this item

Files Size Format View

There are no files associated with this item.

This item appears in the following Collection(s)

Show simple item record

Search DSpace


Advanced Search

Browse

My Account