BITS Faculty Publications

Permanent URI for this communityhttp://localhost:4000/handle/123456789/1867

Browse

Search Results

Now showing 1 - 2 of 2
  • Item
    A Survey and Experimental Review on Data Distribution Strategies for Parallel Spatial Clustering Algorithms
    (Springer, 2024-06) Challa, Jagat Sesh; Balasubramaniam, Sundar; Goyal, Navneet; Goyal, Poonam
    The advent of Big Data has led to the rapid growth in the usage of parallel clustering algorithms that work over distributed computing frameworks such as MPI, MapReduce, and Spark. An important step for any parallel clustering algorithm is the distribution of data amongst the cluster nodes. This step governs the methodology and performance of the entire algorithm. Researchers typically use random, or a spatial/geometric distribution strategy like kd-tree based partitioning and grid-based partitioning, as per the requirements of the algorithm. However, these strategies are generic and are not tailor-made for any specific parallel clustering algorithm. In this paper, we give a very comprehensive literature survey of MPI-based parallel clustering algorithms with special reference to the specific data distribution strategies they employ. We also propose three new data distribution strategies namely Parameterized Dimensional Split for parallel density-based clustering algorithms like DBSCAN and OPTICS, Cell-Based Dimensional Split for dGridSLINK, which is a grid-based hierarchical clustering algorithm that exhibits efficiency for disjoint spatial distribution, and Projection-Based Split, which is a generic distribution strategy. All of these preserve spatial locality, achieve disjoint partitioning, and ensure good data load balancing. The experimental analysis shows the benefits of using the proposed data distribution strategies for algorithms they are designed for, based on which we give appropriate recommendations for their usage.
  • Item
    A High Performance Computing Framework for Data Mining
    (IEEE, 2016) Goyal, Navneet; Goyal, Poonam
    Mining large data sets is no longer the prerogative of computer scientists - specialists in a wide variety of domains are performing analytics as a day-to-day activity. Often such analyses are specific to the domain and analysts are required to devise new algorithms or techniques. For such scenarios, providing a high-level programming environment that delivers high performance on clusters is a challenge. We propose a framework that supports high-level programming using domain abstractions in data mining while delivering scalable performance on commodity clusters i.e. clusters of multi-core workstations. This framework includes a domain specific programming language, DWARF, to enable data mining specialists to rapidly prototype algorithms. DWARF is supported by a compiler that automatically parallelizes code by identifying domain specific patterns and translating them to parallel code that exploits data parallelism and task parallelism. The compiler generates code for a hybrid virtual machine supporting distributed memory model at the top level and shared memory model nested within. The code generated by the compiler can be scheduled on commodity clusters. We compare the proposed framework with other frameworks commonly used for data mining on distributed platforms.