BITS Faculty Publications

Permanent URI for this communityhttp://localhost:4000/handle/123456789/1867

Browse

Search Results

Now showing 1 - 2 of 2
  • Item
    AnyStreamKM: Anytime k-medoids Clustering for Streaming Data
    (IEEE, 2022) Challa, Jagat Sesh; Goyal, Navneet; Goyal, Poonam
    Stream Clustering algorithms have gained a lot of importance in the recent past due to rapid rising utilities of IoT systems and applications. Anytime algorithms and frameworks play a key role in handling streams that have data arriving/generating at variable rates. They are capable of handling both slow and fast stream speeds, at the same time generate the result with highest possible accuracy. In this paper, we present AnyStreamKM, which is a framework for anytime k-medoids clustering of data streams. It uses a proposed hierarchical data indexing structure known as AnyKMTree that stores the incoming data from the stream in the form of hierarchy of micro-clusters. AnyKMTree is an adaptation of R-tree with its splitting strategy inspired from the design principles of k-medoids clustering. AnyKMTree not only supports anytime features but is also capable of filtering out noise and outliers. Our experimental analysis establishes that AnyKMTree produces micro-clusters that are more compact and purer than the state-of-the-art methods. Also, when offline k-medoids clustering such as PAM (Partitioning Around Medoids) is applied on the micro-clusters produced by AnyKMTree, the resultant clustering has been found to be of higher quality than the state-of-the-art methods.
  • Item
    Automatic parallelization of representative-based clustering algorithms for multicore cluster systems
    (Springer, 2020-03) Goyal, Navneet; Goyal, Poonam
    Ease of programming and optimal parallel performance have historically been on the opposite side of a trade-off, forcing the user to choose. With the advent of the Big Data era and the rapid evolution of sequential algorithms, the data analytics community can no longer afford the trade-off. We observed that several clustering algorithms often share common traits—particularly, algorithms belonging to the same class of clustering exhibit significant overlap in processing steps. Here, we present our observation on domain patterns in representative-based clustering algorithms and how they manifest as clearly identifiable programming patterns when mapped to a Domain Specific Language (DSL). We have integrated the signatures of these patterns in the DSL compiler for parallelism identification and automatic parallel code generation. The compiler either generates MPI C++ code for distributed memory parallel processing or MPI–OpenMP C++ code for hybrid memory parallel processing, depending upon the target architecture. Our experiments on different state-of-the-art parallelization frameworks show that our system can achieve near-optimal speedup while requiring a fraction of the programming effort, making it an ideal choice for the data analytics community. Results are presented for both distributed and hybrid memory systems.