Abstract:
Ease of programming and optimal parallel performance have historically been on the opposite side of a tradeoff, forcing the user to choose. With the advent of the Big Data era and rapid evolution of sequential algorithms, the data analytics community can no longer afford the tradeoff. We observed that several clustering algorithms often share common traits - particularly, algorithms belonging to same class of clustering exhibit significant overlap in processing steps. Here, we present our observation on domain patterns in Representative-based clustering algorithms and how they manifest as clearly identifiable programming patterns when mapped to a Domain Specific Language (DSL). We have integrated the signatures of these patterns in the DSL compiler for parallelism identification and automatic parallel code generation. Our experiments on different state-of-the-art parallelization frameworks shows that our system is able to achieve near-optimal speedup while requiring a fraction of the programming effort, making it an ideal choice for the data analytics community.