Rapid Prototyping of Hierarchical Agglomerative Clustering Algorithms for Distributed Systems
No Thumbnail Available
Date
2019
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
IEEE
Abstract
Hierarchical Agglomerative Clustering (HAC) algorithms
are used in many applications where clusters have
a hierarchical relationship between them. Their parallelization
is challenging due to the dependence of every agglomeration
step on all previous agglomerations. Although a few parallel
algorithms have been proposed for SLINK HAC algorithm,
only limited work has been done to parallelize other HAC
algorithms. In this paper, we present a high-level abstraction,
which provides a uniform way to specify any HAC algorithm,
and a framework for automatic parallelization of the same
for distributed memory systems. The abstraction is supported
by constructs in a high level, domain specific language, and
a compiler translates algorithms expressed in this language
to efficient parallel code targeting distributed systems. Our
experiments on multiple HAC algorithms proves that the runtime
performance achieved is comparable with state-of-the-art manual
parallel implementations on Spark and MPI while requiring only
a fraction of the programming effort. At runtime, master-slave
execution is used, and load is balanced among the slaves in an
algorithm-agnostic way, which is a significant contrast to custom
load-balancing techniques seen in the literature on parallel HAC
algorithms.
Description
Keywords
Computer Science, Hierarchical Agglomerative Clustering, High Performance Computing, Big Data, Automatic Parallelization