Department of Computer Science and Information Systems

Permanent URI for this collectionhttp://localhost:4000/handle/123456789/1928

Browse

Search Results

Now showing 1 - 10 of 10
  • Item
    An Adaptive Hierarchical Method for Anytime Set-wise Clustering of Variable and High-Speed Data Streams
    (IEEE, 2023) Challa, Jagat Sesh; Goyal, Poonam; Goyal, Navneet
    Set-wise Clustering is a clustering technique for data streams that groups sets of objects based on distribution patterns, applicable in contexts like retail chain clustering, text-based community clustering, restaurant categorization, etc. The existing set-wise clustering method cannot handle variable and high-speed streams with reasonable accuracy. This paper presents an Anytime Set-wise Clustering method for data streams known as ANYSETCLUS. The method handles the variable inter-arrival rates of stream objects using a proposed indexing structure called AnySetClusTree, which stores a hierarchy of micro-clusters of multi-set entities at varying granularity. ANYSETCLUS is highly adaptive as it supports incremental model updates, segregates outliers, enables outlier-to-concept transition, and captures concept drift. The method also enables anytime offline clustering wherein it can generate multiple clusterings of varying granularity and purity depending upon the available time allowance for final clustering. The experimental results affirm the superior efficacy of the proposed method in handling variable and high-speed streams compared to the state-of-the-art method. The experimental results also showcase its effectiveness in achieving significantly higher micro-cluster purity for low and high-speed streams. This contrasts with the state-of-the-art method, which is unable to generate valid clustering results for high-speed streams. The experiments further validate the proposed method’s capability for anytime offline clustering.
  • Item
    Fusion of multivariate time series meteorological and static soil data for multistage crop yield prediction using multi-head self attention network
    (Elsevier, 2023-09) Goyal, Poonam; Goyal, Navneet
    Yield prediction is helpful for timely harvest management, crop planning, and food security. It depends on many factors like location, climate, soil characteristics, genotype, etc. The data used in yield prediction is a typical mix of highly dynamic time series (meteorological) and static (soil) data. We effectively integrate the two data categories to train a deep-learning model. We introduce a novel attribute selection algorithm to select the most discriminating soil features and modified it for depth-level selection which suggests the most appropriate depth of soil factors for a given crop. We have also introduced a novel approach for modeling the problem where spatiality is handled by clustering locations based on their meteorological and soil characteristics which allow our model to learn spatial patterns. The variation in sowing and harvesting time across locations is taken care of by using padded crop cycle data. We have also taken several other design decisions and validated them on existing models. We experimented with NC94 data of the US with three major crops soybean, wheat, and corn, and predicted yield at the county-level. We have also modified our model to perform in-season and multi-time horizon prediction. The results of our proposed YieldPredictNet show that it outperforms competing techniques.
  • Item
    Parallel Framework for Efficient k-means Clustering
    (ACM Digital Library, 2015-10) Goyal, Navneet; Goyal, Poonam
    Handling and processing of larger volume of data requires efficient data mining algorithms. k-means is a very popular clustering algorithm for data mining, but its performance suffers because of initial seeding problem. The computation time of k-means algorithm is directly proportional to the number of data-points, number of dimensions, and number of iterations, therefore, it is very expensive to process large data-points sequentially. We proposed an efficient parallel framework which includes dimensionality-reduction as well as data-size reduction techniques to improve k-means processing time and initial seeding problem. Our proposed parallel framework leverages the multi-node and multi-core architectures of a typical commodity cluster. We have validated our proposed approaches with real and synthetic datasets in parallel environment setup. The experimental results clearly show the significant improvements in k-means performance.
  • Item
    Multilevel Event Detection, Storyline Generation, and Summarization for Tweet Streams
    (IEEE, 2020) Goyal, Navneet; Goyal, Poonam
    Users acting as real-time sensors post information about current events on various social media sites like Twitter, Facebook, Instagram, and so on. This generates a huge amount of data requiring significant effort to process and filter it to detect events/topics. It becomes more challenging when data are generated as a tweet stream because of its speed, presence of noise, slangs, phrases, abbreviations, and so on. In recent years, many approaches have been proposed either for detecting small- or large-scale events, individually. There is a lack of a complete solution that provides analysis from different perspectives. We propose a novel approach Mythos that detects events, subevents within an event, and generates abstract summary and storyline to provide different perspectives for an event. There are three modules in Mythos. Online incremental clustering algorithm identifies small-scale events in the form of small clusters, the event hierarchy generator generates bigger events in the form of hierarchies, and the summarization module produces summary of events/subevents. The summarization module uses a long short-term memory (LSTM)-based learning model to generate summaries at different levels-from the most abstracted to the most detailed. The summaries at different levels are used to generate a storyline for the event. Our experimental analysis on a variety of twitter data sets from different domains compares Mythos against the known existing approaches for event detection and summarization. It outperforms baseline approaches for both. The generated summaries are evaluated against summaries provided by external reference sources like Guardian and Wikipedia.
  • Item
    Topical document clustering: two-stage post processing technique
    (Inder Science, 2018) Goyal, Poonam; Goyal, Navneet
    Clustering documents is an essential step in improving efficiency and effectiveness of information retrieval systems. We propose a two-phase split-merge (SM) algorithm, which can be applied to topical clusters obtained from existing query-context-aware document clustering algorithms, to produce soft topical document clusters. The SM is a post-processing technique which combines the advantages of document and feature-pivot topical document clustering approaches. The split phase splits the topical clusters by relating them to the topics obtained by disambiguating web search results, and converts them into homogeneous soft clusters. In the merge phase, similar clusters are merged by feature-pivot approach. The SM is tested on the outcome of two hierarchical query-context aware document clustering algorithms on different datasets including TREC session-track 2011 dataset. The obtained topical-clusters are also updated by an incremental approach with the progress in the data stream. The proposed algorithm improves the quality of clustering appreciably in all the experiments conducted.
  • Item
    Phase-Wise Clustering of Time Series Gene Expression Data
    (IEEE, 2011) Goyal, Navneet; Goyal, Poonam
    Extensive studies have shown that analyzing microarray time series data is important in bioinformatics research and biomedical applications. An observation in the analysis of gene expression data is that many genes have similarity in their expression patterns and therefore appear to be co-regulated. Previously, the time series gene expression data was analyzed mainly by checking the global similarities between the gene expression profiles and local similarities were overlooked. Local similarities can provide useful insight into gene behavior. In this paper, we propose a clustering algorithm for analyzing the time series gene expression data to identify the gene clusters based on the phase-wise local similarities in the cell cycle. Our approach exploits the fact that the genes which are involved in one phase of a cell cycle would have a characteristic profile for time points belonging to that phase and may not be involved in other phases. Moreover, a gene that is clustered with a set of genes in one phase might be involved with a different set of genes in other phases. In the proposed approach, we first clustered the genes at every time point of a phase and group genes with similar expression profiles, i.e., we group those genes together which remain in the same cluster at every time point within a phase. The functions of genes were obtained from Gene Ontology. In this paper, the results are presented for different phases of a cell cycle. Candidate genes are identified for these phases and their groups are analyzed. We found that the group of candidate genes had few genes which are known to be involved. Furthermore, some genes are found to be involved in more than one phase with different set of genes. Results presented show that local similarities can provide useful insight into gene behavior. Results are compared with an existing algorithm, STEM.
  • Item
    A Fast, Scalable SLINK Algorithm for Commodity Cluster Computing Exploiting Spatial Locality
    (IEEE, 2016) Goyal, Navneet; Goyal, Poonam
    Single linkage (SLINK) hierarchical clustering algorithm is a preferred clustering algorithm over traditional partitioning-based clustering as it does not require the number of clusters as input. But, due to its high time complexity and inherent data dependencies, it does not scale well for large datasets. To the best of our knowledge, all existing parallel SLINK algorithms are based on the traditional SLINK algorithm and thus require large number of computing resources. In this paper, we present a novel optimization of SLINK algorithm, GridSLINK, which is an order of magnitude faster than the existing state-of-the-art implementation. The optimization in GridSLINK comes from reduction in number of distance calculations required by SLINK. This reduction is achieved by exploiting spatial locality of data points and using an adaptive gridding technique. GridSLINK is parallelized for distributed memory systems. Scalable performance is achieved for increasing number of compute nodes. The proposed parallel algorithm, dGridSLINK, is benchmarked against the best existing parallel algorithm in literature and found to outperform the latter for all the real datasets considered. dGridSLINK can cluster millions of data points in few seconds/minutes using a small number of processing elements, without compromising the quality of clustering.
  • Item
    Scalable Parallel Algorithms for Shared Nearest Neighbor Clustering
    (IEEE, 2016) Goyal, Navneet; Goyal, Poonam
    Clustering is a popular data mining technique which discovers structure in unlabeled data by grouping objects together on the basis of a similarity criterion. Traditional similarity measures lose their meaning as the number of dimensions increases and as a consequence, distance or density based clustering algorithms become less meaningful. Shared Nearest Neighbor (SNN) is a solution to clustering high-dimensional data with the ability to find clusters of varying density. SNN assigns objects to a cluster, which share a large number of their nearest neighbors. However, SNN is compute and memory intensive for data of large size and/or dimensionality. Nearest neighbor queries are responsible for a major proportion of computations in SNN, resulting in lower efficiency for higher value of number of nearest neighbors (k). The main motivation of this work is to improve the efficiency of SNN and to parallelize it so that it can be used for clustering large high-dimensional datasets and for large values of k. Existing SNN algorithms become inefficient in these situations. In this paper, we present a new sequential SNN algorithm, R-SNN, which uses R-tree for executing neighborhood queries efficiently and exploiting spatial locality to minimize memory usage. R-SNN is benchmarked against the best available implementation of SNN and is found up to 77 times faster when tested on various real datasets. R-SNN is parallelized for distributed memory, shared memory, and hybrid systems. Significant speedup and scalability achieved can be attributed to parallelization and good load balancing strategies and also to exploitation of spatial locality. Experimental results demonstrate the same for datasets of varying dimensionality and size. The maximum speedup achieved for shared, distributed, and hybrid models are 427.19 using 48 threads, 394.24 using 32 processes, and 1380.69 on 32 nodes (with each node spawning 4 threads), respectively
  • Item
    A Domain Specific Language for Clustering
    (Springer, 2016-11) Goyal, Navneet; Goyal, Poonam
    Clustering of large volumes of data is a complex problem which requires use of sophisticated algorithms as well as High Performance Computing hardware like a cluster of computers. It is highly desirable that data mining experts have a solution which on one hand provides a simple interface for ex-pressing their algorithms in terms of domain specific idioms and on the other hand automatically generates parallel code that can run on a cluster of multicore nodes. The proposed Domain Specific Language (DSL) along with its parallelizing compiler attempts to provide a solution. In this paper, we give the design of the DSL, called DWARF. Various language constructs have been described along with the rationale behind their inclusion in the language. A qualitative comparison of abstraction provided by DWARF is compared with MapReduce, Spark, and other MPI-based implementations to establish the usefulness of the proposed clustering DSL.
  • Item
    Pattern-Based Automatic Parallelization of Representative-Based Clustering Algorithms
    (IEEE, 2018) Goyal, Poonam; Goyal, Navneet
    Ease of programming and optimal parallel performance have historically been on the opposite side of a tradeoff, forcing the user to choose. With the advent of the Big Data era and rapid evolution of sequential algorithms, the data analytics community can no longer afford the tradeoff. We observed that several clustering algorithms often share common traits - particularly, algorithms belonging to same class of clustering exhibit significant overlap in processing steps. Here, we present our observation on domain patterns in Representative-based clustering algorithms and how they manifest as clearly identifiable programming patterns when mapped to a Domain Specific Language (DSL). We have integrated the signatures of these patterns in the DSL compiler for parallelism identification and automatic parallel code generation. Our experiments on different state-of-the-art parallelization frameworks shows that our system is able to achieve near-optimal speedup while requiring a fraction of the programming effort, making it an ideal choice for the data analytics community.