BITS Faculty Publications
Permanent URI for this communityhttp://localhost:4000/handle/123456789/1867
Browse
4 results
Search Results
Item Twitter Data Modelling and Provenance Support for Key-Value Pair Databases(Springer, 2021-02) Goyal, NavneetIn Big Data environments, reliability of data plays an important role to determine trustworthiness of the outcomes of an analysis. Big data provenance ensures the reliability of data by providing details about the origin and historical paths of data. In recent years, the preponderance of big data and its applications are increasingly using Apache Cassandra due to its high availability and linear scalability. In this paper, we present a data provenance framework for Key-Value Pair Databases using the concept of Zero-Information Loss Database (ZILD). A large volume of real-time social media data is fetched from the Twitter’s network through live streaming with the help of Twitter Streaming APIs, and then modelled in Apache Cassandra based on a Query-Driven approach. This framework provides efficient provenance capturing support for select, aggregate, update, and historical queries. We evaluate the performance of proposed framework in terms of provenance capturing and querying capabilities using appropriate query sets.Item Big Data and Artificial Intelligenc(Springer, 2023) Goyal, NavneetThis book constitutes the proceedings of the 11th International Conference on Big Data and Artificial Intelligence, BDA 2023, held in Delhi, India, during December 7–9, 2023. The17 full papers presented in this volume were carefully reviewed and selected from 67 submissions. The papers are organized in the following topical sections: Keynote Lectures, Artificial Intelligence in Healthcare, Large Language Models, Data Analytics for Low Resource Domains, Artificial Intelligence for Innovative Applications and Potpourri.Item A Survey and Experimental Review on Data Distribution Strategies for Parallel Spatial Clustering Algorithms(Springer, 2024-06) Challa, Jagat Sesh; Balasubramaniam, Sundar; Goyal, Navneet; Goyal, PoonamThe advent of Big Data has led to the rapid growth in the usage of parallel clustering algorithms that work over distributed computing frameworks such as MPI, MapReduce, and Spark. An important step for any parallel clustering algorithm is the distribution of data amongst the cluster nodes. This step governs the methodology and performance of the entire algorithm. Researchers typically use random, or a spatial/geometric distribution strategy like kd-tree based partitioning and grid-based partitioning, as per the requirements of the algorithm. However, these strategies are generic and are not tailor-made for any specific parallel clustering algorithm. In this paper, we give a very comprehensive literature survey of MPI-based parallel clustering algorithms with special reference to the specific data distribution strategies they employ. We also propose three new data distribution strategies namely Parameterized Dimensional Split for parallel density-based clustering algorithms like DBSCAN and OPTICS, Cell-Based Dimensional Split for dGridSLINK, which is a grid-based hierarchical clustering algorithm that exhibits efficiency for disjoint spatial distribution, and Projection-Based Split, which is a generic distribution strategy. All of these preserve spatial locality, achieve disjoint partitioning, and ensure good data load balancing. The experimental analysis shows the benefits of using the proposed data distribution strategies for algorithms they are designed for, based on which we give appropriate recommendations for their usage.Item A High Performance Computing Framework for Data Mining(IEEE, 2016) Goyal, Navneet; Goyal, PoonamMining large data sets is no longer the prerogative of computer scientists - specialists in a wide variety of domains are performing analytics as a day-to-day activity. Often such analyses are specific to the domain and analysts are required to devise new algorithms or techniques. For such scenarios, providing a high-level programming environment that delivers high performance on clusters is a challenge. We propose a framework that supports high-level programming using domain abstractions in data mining while delivering scalable performance on commodity clusters i.e. clusters of multi-core workstations. This framework includes a domain specific programming language, DWARF, to enable data mining specialists to rapidly prototype algorithms. DWARF is supported by a compiler that automatically parallelizes code by identifying domain specific patterns and translating them to parallel code that exploits data parallelism and task parallelism. The compiler generates code for a hybrid virtual machine supporting distributed memory model at the top level and shared memory model nested within. The code generated by the compiler can be scheduled on commodity clusters. We compare the proposed framework with other frameworks commonly used for data mining on distributed platforms.