Department of Computer Science and Information Systems
Permanent URI for this collectionhttp://localhost:4000/handle/123456789/1928
Browse
3 results
Search Results
Item Grid-R-tree: a data structure for efficient neighborhood and nearest neighbor queries in data mining(Springer, 2020-04) Goyal, Poonam; Goyal, Navneet; Challa, Jagat SeshThe use of multi-dimensional indexing structures has gained a lot of attention in data mining. The most commonly used data structures for indexing data are R-tree and its variants, quad-tree, k-d-tree, etc. These data structures support region queries (point, window and neighborhood queries) and nearest neighbor queries. These queries are extensively used in data mining algorithms. Although these data structures facilitate execution of the above queries in logarithmic time, the constraints associated with them become bottleneck in query execution, when used for large and high-dimensional datasets. Moreover, these indexing structures do not cater to specific data access patterns of data mining algorithms. In this paper, we propose a new data structure Grid-R-tree, a grid based R-tree which is specifically designed to address the querying requirements of multiple data mining algorithms. Grid-R-tree is a simple, yet effective adaptation of R-tree using the concept of Grid. We also introduce a new query over Grid-R-tree, called cell-wise epsilon neighborhood query (CellWiseNBH), which captures the locality in query execution pattern of density-based clustering algorithms, and enables us to redesign them for improving their efficiency. Our theoretical and experimental analysis shows that the proposed data structure outperforms the conventional R-tree in terms of neighborhood and nearest neighbor queries. The experiments were conducted on datasets of size up to 100 million and dimensionality up to 74. The results also suggest that Grid-R-tree improves the efficiency of data mining algorithms such as k-nearest neighbor classifier and DBSCAN clustering (including the redesigned version that uses CellWiseNBH). Additionally, an adaptive grid optimization has been applied on dense cells that have number of indexed data points greater than a threshold τ to keep equal load distribution in the cells, which resulted in more efficient query performance for datasets that have skewed distribution of data points.Item DD-Rtree: A dynamic distributed data structure for efficient data distribution among cluster nodes for spatial data mining algorithms(IEEE, 2016) Goyal, Navneet; Goyal, Poonam; Challa, Jagat SeshParallelizing data mining algorithms has become a necessity as we try to mine ever increasing volumes of data. Spatial data mining algorithms like Dbscan, Optics, Slink, etc. have been parallelized to exploit a cluster infrastructure. The efficiency achieved by existing algorithms can be attributed to spatial locality preservation using spatial indexing structures like k-d-tree, quad-tree, grid files, etc. for distributing data among cluster nodes. However, these indexing structures are static in nature, i.e., they need to scan the entire dataset to determine the partitioning coordinates. This results in high data distribution cost when the data size is large. In this paper, we propose a dynamic distributed data structure, DD-Rtree, which preserves spatial locality while distributing data across compute nodes in a shared nothing environment. Moreover, DD-Rtree is dynamic, i.e., it can be constructed incrementally making it useful for handling big data. We compare the quality of data distribution achieved by DD-Rtree with one of the recent distributed indexing structure, SD-Rtree. We also compare the efficiency of queries supported by these indexing structures along with the overall efficiency of DBSCAN algorithm. Our experimental results show that DD-Rtree achieves better data distribution and thereby resulting in improved overall efficiency.Item AnySC: Anytime Set-wise Classification of Variable Speed Data Streams(IEEE, 2018-12) Goyal, Navneet; Goyal, Poonam; Challa, Jagat SeshClassification of data streams has gained a lot of popularity in recent years owing to its multiple applications. In certain applications like community detection from text feeds, website fingerprinting attack, etc., it is more meaningful to associate class labels with groups of objects rather than the individual objects. This kind of classification problem is known as the set-wise classification problem. The few algorithms available in literature for this problem are budget algorithms, i.e. they are designed to process fixed maximum stream speed, and are not capable of handling variable and high speed streams. We present ANYSC which is the first anytime set-wise classification algorithm for data streams. ANYSC handles variable inter-arrival rate of objects in the stream and performs classification of test entities within any available time allowance, using a proposed data structure referred to as CProf-forest. The experimental results show that ANYSC brings in the features of an anytime algorithm and outperforms the existing approaches.