Parallel SLINK for big data

Goyal, Navneet; Goyal, Poonam

DSpace Home
→
BITS Faculty Publications
→
Department of Computer Science and Information Systems
→
View Item

dc.contributor.author	Goyal, Navneet
dc.contributor.author	Goyal, Poonam
dc.date.accessioned	2022-12-26T10:06:49Z
dc.date.available	2022-12-26T10:06:49Z
dc.date.issued	2019-06
dc.identifier.uri	https://link.springer.com/article/10.1007/s41060-019-00188-y
dc.identifier.uri	http://dspace.bits-pilani.ac.in:8080/xmlui/handle/123456789/8134
dc.description.abstract	The major strength of hierarchical clustering algorithms is that it allows visual interpretations of clusters through dendrograms. Users can cut the dendrogram at different levels to get desired number of clusters. A major problem with hierarchical algorithms is their quadratic runtime complexity, which limits the amount of data that can be clustered in reasonable amount of time. Also, due to its agglomerative merging process, each iteration depends on the data of all previous iterations, making it difficult to parallelize. Thus, there is a need for an efficient parallel implementation of SLINK algorithm which can scale to big data. We present a parallel SLINK algorithm, sGridSLINK, for shared memory architectures. sGridSLINK produces exactly the same dendrogram as the classical SLINK algorithm. We also present, hGridSLINK, a parallel algorithm which fully exploits a multi-core cluster system. To the best of our knowledge, there is no hybrid parallel algorithm for SLINK available in the literature. The proposed algorithms exploit spatial locality of data to reduce the number of distance calculations. Adaptive gridding is used to counter skewness in data and to ensure load balancing. Extensive experiments are carried out to establish the efficiency and scalability of proposed parallel algorithms. sGridSLINK is approximately 840 times faster than the state-of-the-art algorithm using 55 threads on a 48-core machine on a real dataset having 6 million data points. It also achieves a speedup of 47.93 over the best known sequential SLINK, GridSLINK, on a real dataset using 48 threads on a 48-core machine. hGridSLINK achieves a maximum speedup of 68.26 on a 32-node cluster (32×4 processing elements) with respect to GridSLINK. The hGridSLINK algorithm is able to cluster 200 million data points in only 1317 s (less than 22 min). No existing parallel SLINK algorithm is capable of such efficient clustering of Big Data.	en_US
dc.language.iso	en	en_US
dc.publisher	Springer	en_US
dc.subject	Computer Science	en_US
dc.subject	SLINK	en_US
dc.subject	Big Data	en_US
dc.title	Parallel SLINK for big data	en_US
dc.type	Article	en_US

Files in this item

Files	Size	Format	View
There are no files associated with this item.

This item appears in the following Collection(s)

Department of Computer Science and Information Systems [1099]

Show simple item record

Search DSpace

Advanced Search

Browse

All of DSpace
This Collection
- By Issue Date
- Authors
- Titles
- Subjects

Parallel SLINK for big data

Files in this item

This item appears in the following Collection(s)

Search DSpace

Browse

All of DSpace

This Collection

My Account