Department of Computer Science and Information Systems
Permanent URI for this collectionhttp://localhost:4000/handle/123456789/1928
Browse
6 results
Search Results
Item Grid-R-tree: a data structure for efficient neighborhood and nearest neighbor queries in data mining(Springer, 2020-04) Goyal, Poonam; Goyal, Navneet; Challa, Jagat SeshThe use of multi-dimensional indexing structures has gained a lot of attention in data mining. The most commonly used data structures for indexing data are R-tree and its variants, quad-tree, k-d-tree, etc. These data structures support region queries (point, window and neighborhood queries) and nearest neighbor queries. These queries are extensively used in data mining algorithms. Although these data structures facilitate execution of the above queries in logarithmic time, the constraints associated with them become bottleneck in query execution, when used for large and high-dimensional datasets. Moreover, these indexing structures do not cater to specific data access patterns of data mining algorithms. In this paper, we propose a new data structure Grid-R-tree, a grid based R-tree which is specifically designed to address the querying requirements of multiple data mining algorithms. Grid-R-tree is a simple, yet effective adaptation of R-tree using the concept of Grid. We also introduce a new query over Grid-R-tree, called cell-wise epsilon neighborhood query (CellWiseNBH), which captures the locality in query execution pattern of density-based clustering algorithms, and enables us to redesign them for improving their efficiency. Our theoretical and experimental analysis shows that the proposed data structure outperforms the conventional R-tree in terms of neighborhood and nearest neighbor queries. The experiments were conducted on datasets of size up to 100 million and dimensionality up to 74. The results also suggest that Grid-R-tree improves the efficiency of data mining algorithms such as k-nearest neighbor classifier and DBSCAN clustering (including the redesigned version that uses CellWiseNBH). Additionally, an adaptive grid optimization has been applied on dense cells that have number of indexed data points greater than a threshold τ to keep equal load distribution in the cells, which resulted in more efficient query performance for datasets that have skewed distribution of data points.Item Scalable Parallel Algorithms for Shared Nearest Neighbor Clustering(IEEE, 2016) Goyal, Navneet; Goyal, PoonamClustering is a popular data mining technique which discovers structure in unlabeled data by grouping objects together on the basis of a similarity criterion. Traditional similarity measures lose their meaning as the number of dimensions increases and as a consequence, distance or density based clustering algorithms become less meaningful. Shared Nearest Neighbor (SNN) is a solution to clustering high-dimensional data with the ability to find clusters of varying density. SNN assigns objects to a cluster, which share a large number of their nearest neighbors. However, SNN is compute and memory intensive for data of large size and/or dimensionality. Nearest neighbor queries are responsible for a major proportion of computations in SNN, resulting in lower efficiency for higher value of number of nearest neighbors (k). The main motivation of this work is to improve the efficiency of SNN and to parallelize it so that it can be used for clustering large high-dimensional datasets and for large values of k. Existing SNN algorithms become inefficient in these situations. In this paper, we present a new sequential SNN algorithm, R-SNN, which uses R-tree for executing neighborhood queries efficiently and exploiting spatial locality to minimize memory usage. R-SNN is benchmarked against the best available implementation of SNN and is found up to 77 times faster when tested on various real datasets. R-SNN is parallelized for distributed memory, shared memory, and hybrid systems. Significant speedup and scalability achieved can be attributed to parallelization and good load balancing strategies and also to exploitation of spatial locality. Experimental results demonstrate the same for datasets of varying dimensionality and size. The maximum speedup achieved for shared, distributed, and hybrid models are 427.19 using 48 threads, 394.24 using 32 processes, and 1380.69 on 32 nodes (with each node spawning 4 threads), respectivelyItem A High Performance Computing Framework for Data Mining(IEEE, 2016) Goyal, Navneet; Goyal, PoonamMining large data sets is no longer the prerogative of computer scientists - specialists in a wide variety of domains are performing analytics as a day-to-day activity. Often such analyses are specific to the domain and analysts are required to devise new algorithms or techniques. For such scenarios, providing a high-level programming environment that delivers high performance on clusters is a challenge. We propose a framework that supports high-level programming using domain abstractions in data mining while delivering scalable performance on commodity clusters i.e. clusters of multi-core workstations. This framework includes a domain specific programming language, DWARF, to enable data mining specialists to rapidly prototype algorithms. DWARF is supported by a compiler that automatically parallelizes code by identifying domain specific patterns and translating them to parallel code that exploits data parallelism and task parallelism. The compiler generates code for a hybrid virtual machine supporting distributed memory model at the top level and shared memory model nested within. The code generated by the compiler can be scheduled on commodity clusters. We compare the proposed framework with other frameworks commonly used for data mining on distributed platforms.Item DD-Rtree: A dynamic distributed data structure for efficient data distribution among cluster nodes for spatial data mining algorithms(IEEE, 2016) Goyal, Navneet; Goyal, Poonam; Challa, Jagat SeshParallelizing data mining algorithms has become a necessity as we try to mine ever increasing volumes of data. Spatial data mining algorithms like Dbscan, Optics, Slink, etc. have been parallelized to exploit a cluster infrastructure. The efficiency achieved by existing algorithms can be attributed to spatial locality preservation using spatial indexing structures like k-d-tree, quad-tree, grid files, etc. for distributing data among cluster nodes. However, these indexing structures are static in nature, i.e., they need to scan the entire dataset to determine the partitioning coordinates. This results in high data distribution cost when the data size is large. In this paper, we propose a dynamic distributed data structure, DD-Rtree, which preserves spatial locality while distributing data across compute nodes in a shared nothing environment. Moreover, DD-Rtree is dynamic, i.e., it can be constructed incrementally making it useful for handling big data. We compare the quality of data distribution achieved by DD-Rtree with one of the recent distributed indexing structure, SD-Rtree. We also compare the efficiency of queries supported by these indexing structures along with the overall efficiency of DBSCAN algorithm. Our experimental results show that DD-Rtree achieves better data distribution and thereby resulting in improved overall efficiency.Item AnySC: Anytime Set-wise Classification of Variable Speed Data Streams(IEEE, 2018-12) Goyal, Navneet; Goyal, Poonam; Challa, Jagat SeshClassification of data streams has gained a lot of popularity in recent years owing to its multiple applications. In certain applications like community detection from text feeds, website fingerprinting attack, etc., it is more meaningful to associate class labels with groups of objects rather than the individual objects. This kind of classification problem is known as the set-wise classification problem. The few algorithms available in literature for this problem are budget algorithms, i.e. they are designed to process fixed maximum stream speed, and are not capable of handling variable and high speed streams. We present ANYSC which is the first anytime set-wise classification algorithm for data streams. ANYSC handles variable inter-arrival rate of objects in the stream and performs classification of test entities within any available time allowance, using a proposed data structure referred to as CProf-forest. The experimental results show that ANYSC brings in the features of an anytime algorithm and outperforms the existing approaches.Item A Comparison of Machine Learning Attributes for Detecting Malicious Websites(IEEE, 2019-01) Goyal, NavneetThe number of Malicious Websites has increased manifold in the past few years. As on start of year 2018, 1 in every 13 URL was malicious, amounting to 7.8% URLs identified as malicious [1]. These figures have increased by 2.8%, thereby showing an increasing trend of attack vectors through Malicious Websites. These statistics clearly highlight the need to detect Malicious Websites on the Internet. Many research works have suggested Machine Learning techniques to detect Malicious Websites. Research has also been done to compare Machine Learning algorithms for their detection. However, the aspect of attribute selection for detecting Malicious Websites using Machine Learning has not been delved in detail. In Machine Learning techniques, attribute selection outweighs the importance of any other aspect in the process. Thus, there is a need to compare and analyze the various attributes that can help find Malicious Websites faster and better. This paper is focused to address this research gap, so that, fewer and optimal attributes can do a better job