Abstract:
DBSCAN is one of the most popular density-based clustering algorithm capable of identifying arbitrary shaped clusters and noise. It is computationally expensive for large data sets. In this paper, we present a grid-based DBSCAN algorithm, GridDBSCAN, which is significantly faster than the state-of-the-art sequential DBSCAN. The efficiency of GridDBSCAN is achieved by reducing the number of neighborhood queries using spatial locality information, without compromising the quality of clusters. We also propose scalable parallel implementations of GridDBSCAN to leverage a multicore commodity cluster. Clustering results of GridDBSCAN and its parallel implementations are exactly the same as that of classical DBSCAN. The performance of proposed algorithms, both sequential and parallel, is benchmarked against the state-of-the-art algorithms by experimenting on various real datasets. Experimental results show considerable performance improvements achieved by GridDBSCAN and its parallel implementations.