Browsing by Author "Goyal, Poonam"

Now showing 1 - 20 of 57

An Adaptive Hierarchical Method for Anytime Set-wise Clustering of Variable and High-Speed Data Streams
(IEEE, 2023) Challa, Jagat Sesh; Goyal, Poonam; Goyal, Navneet
Set-wise Clustering is a clustering technique for data streams that groups sets of objects based on distribution patterns, applicable in contexts like retail chain clustering, text-based community clustering, restaurant categorization, etc. The existing set-wise clustering method cannot handle variable and high-speed streams with reasonable accuracy. This paper presents an Anytime Set-wise Clustering method for data streams known as ANYSETCLUS. The method handles the variable inter-arrival rates of stream objects using a proposed indexing structure called AnySetClusTree, which stores a hierarchy of micro-clusters of multi-set entities at varying granularity. ANYSETCLUS is highly adaptive as it supports incremental model updates, segregates outliers, enables outlier-to-concept transition, and captures concept drift. The method also enables anytime offline clustering wherein it can generate multiple clusterings of varying granularity and purity depending upon the available time allowance for final clustering. The experimental results affirm the superior efficacy of the proposed method in handling variable and high-speed streams compared to the state-of-the-art method. The experimental results also showcase its effectiveness in achieving significantly higher micro-cluster purity for low and high-speed streams. This contrasts with the state-of-the-art method, which is unable to generate valid clustering results for high-speed streams. The experiments further validate the proposed method’s capability for anytime offline clustering.
AdQuestA: knowledge-guided visual question answer framework for advertisements
(IEEE, 2025) Goyal, Poonam
In the rapidly evolving landscape of digital marketing, effective customer engagement through advertisements is crucial for brands. Thus, computational understanding of ads is pivotal for recommendation, authoring, and customer behaviour simulation. Despite advancements in knowledge-guided visual-question-answering (VQA) models, existing frameworks often lack domain-specific responses and suffer from a dearth of benchmark datasets for advertisements. To address this gap, we introduce ADVQA, the first dataset for ad-related VQA sourced from Facebook and X (twitter), which facilitates further research in ad comprehension. It comprises open-ended questions and detailed context obtained automatically from web articles. Moreover, we present AdQuestA, a novel multimodal framework for knowledge-guided open-ended question-answering tailored to advertisements. AdQuestA leverages a Retrieval Augmented Generation (RAG) to obtain question-aware ad context as explicit knowledge and image-grounded implicit knowledge, effectively exploiting inherent relationships for reasoning. Extensive experiments corroborate its efficacy, yielding state-of-the-art performance on the AD-VQA dataset, even surpassing 10X larger models such as GPT-4 on this task. Our framework not only enhances understanding of ad content but also advances the broader landscape of knowledge-guided VQA models.
AnyFI: An Anytime Frequent Itemset Mining Algorithm for Data Streams
(IEEE, 2017) Goyal, Navneet; Goyal, Poonam; Challa, Jagat Sesh
Mining frequent itemsets from transactional data streams has been vastly studied in literature. The existing algorithms mine frequent itemsets within the stream's constrained environment of limited time and memory. However, none of them are capable of handling varying inter-arrival rates of streams. Moreover, these algorithms are not capable of giving mining results instantaneously, even with compromised accuracy if required, and improve the accuracy with increase in time allowance. These two properties characterize an anytime algorithm. In this paper, we propose AnyFI, which is the first anytime frequent itemset mining algorithm for data streams. We also propose a novel data structure, BFI-forest, which is capable of handling transactions with varying inter-arrival rate. AnyFI maintains itemsets in BFI-forest in such a way that it can give a mining result almost immediately when time allowance to mine is very less and can refine the results for better accuracy with increase in time allowance. Our experimental results show that AnyFI can handle high stream speeds upto 60,000 transactions per second (tps) with recall close to 100%.
AnySC: Anytime Set-wise Classification of Variable Speed Data Streams
(IEEE, 2018-12) Goyal, Navneet; Goyal, Poonam; Challa, Jagat Sesh
Classification of data streams has gained a lot of popularity in recent years owing to its multiple applications. In certain applications like community detection from text feeds, website fingerprinting attack, etc., it is more meaningful to associate class labels with groups of objects rather than the individual objects. This kind of classification problem is known as the set-wise classification problem. The few algorithms available in literature for this problem are budget algorithms, i.e. they are designed to process fixed maximum stream speed, and are not capable of handling variable and high speed streams. We present ANYSC which is the first anytime set-wise classification algorithm for data streams. ANYSC handles variable inter-arrival rate of objects in the stream and performs classification of test entities within any available time allowance, using a proposed data structure referred to as CProf-forest. The experimental results show that ANYSC brings in the features of an anytime algorithm and outperforms the existing approaches.
AnyStreamKM: Anytime k-medoids Clustering for Streaming Data
(IEEE, 2022) Challa, Jagat Sesh; Goyal, Navneet; Goyal, Poonam
Stream Clustering algorithms have gained a lot of importance in the recent past due to rapid rising utilities of IoT systems and applications. Anytime algorithms and frameworks play a key role in handling streams that have data arriving/generating at variable rates. They are capable of handling both slow and fast stream speeds, at the same time generate the result with highest possible accuracy. In this paper, we present AnyStreamKM, which is a framework for anytime k-medoids clustering of data streams. It uses a proposed hierarchical data indexing structure known as AnyKMTree that stores the incoming data from the stream in the form of hierarchy of micro-clusters. AnyKMTree is an adaptation of R-tree with its splitting strategy inspired from the design principles of k-medoids clustering. AnyKMTree not only supports anytime features but is also capable of filtering out noise and outliers. Our experimental analysis establishes that AnyKMTree produces micro-clusters that are more compact and purer than the state-of-the-art methods. Also, when offline k-medoids clustering such as PAM (Partitioning Around Medoids) is applied on the micro-clusters produced by AnyKMTree, the resultant clustering has been found to be of higher quality than the state-of-the-art methods.
Anytime clustering of data streams while handling noise and concept drift
(Taylor & Francis, 2021-03) Goyal, Poonam; Goyal, Navneet; Challa, Jagat Sesh
Clustering of data streams has become very popular in recent times, owing to rapid rise of real-time streaming utilities that produce large amounts of data at varying inter-arrival rates. We propose AnyClus, a framework for anytime clustering of data streams. AnyClus uses a proposed variant of R-tree, AnyRTree, to capture the incoming stream objects arriving at variable rate, and to index them in the form of micro-clusters of hierarchical fashion. The leaf-level micro-clusters produced are aggregated and stored in a logarithmic tilted-time window framework (TTWF). Our extensive experimental analysis shows (i) the capability of AnyClus in handling variable stream speeds (upto 250k objects/second); (ii) its ability to produce micro-clusters of high purity (≈1) and compactness; (iii) effectiveness of AnyRTree in handling noise, capturing concept drift and preservation of spatial locality in the indexing of micro-clusters, when compared to the existing methods. We also propose a parallel framework, Any-MP-Clus, for anytime clustering of multiport data streams over commodity clusters. Any-MP-Clus uses AnyRTree at each computing node of the cluster (for each stream-port) and maintains the aggregated micro-clusters in TTWF. The experimental results on datasets of billions scale show that Any-MP-Clus is scalable, eﬃcient and produces clustering of higher quality.
Anytime Frequent Itemset Mining of Transactional Data Streams
(Elsevier, 2020-09) Goyal, Poonam; Goyal, Navneet; Challa, Jagat Sesh
Mining frequent itemsets from transactional data streams has become very essential in today's world with many applications such as stock market analysis, retail chain analysis, web log analysis, etc. Various algorithms have been proposed to efficiently mine single-port and multi-port transactional streams within the constraints of limited time and memory. However, all of them are budget algorithms, i.e., they are not capable of handling varying inter-arrival rate of transactions and high speed streams. They are constrained by a maximum limit to the inter-arrival rate of transactions, beyond which they fail to process. Also, these algorithms are not capable of giving immediate mining results, even with compromised accuracy if required. The above two properties characterize an anytime algorithm. We propose AnyFI, which is the first anytime frequent itemset mining algorithm for data streams. AnyFI uses a novel data structure - BFI-forest, which is capable of handling transactions arriving at variable rate. It maintains itemsets in BFI-forest in such a way that it can give a mining result almost immediately when the time allowance to mine is very less and can refine its accuracy with increase in time allowance. We also propose MPAnyFI which extends AnyFI into a parallel framework for anytime frequent itemset mining of multi-port data streams over commodity clusters. It uses AnyFI at each computing node of the cluster. Our extensive experimental analysis shows that AnyFI can handle high stream speeds close to 60,000 trans/sec with recall close to 100%. They also show the efficiency of MPAnyFI.
An approach for search result topic identification and labeling
(ACM Digital Library, 2015-03) Goyal, Poonam
Organizing search results is one of the challenging task of the search engines due to various and dynamic intentions of the queries. As a consequence search engines are not able to understand the exact user context, and thus retrieve large volumes of results, most of which are irrelevant to the user. Search Result Clustering (SRC) is a technique which groups the search results and presents users the various intentions of the query. In this work, we have proposed an approach that first identifies the associated topics and represents them in the form of concepts and then forms groups of documents by assigning each document to the appropriate topic and in the end it provides suitable labels to these topics. Experimental results show that the proposed method is able to produce encouraging results as compared to the most popular non-commercial methods Lingo and STC on standard datasets such as ODP and Ambient datasets.
Automatic parallelization of representative-based clustering algorithms for multicore cluster systems
(Springer, 2020-03) Goyal, Navneet; Goyal, Poonam
Ease of programming and optimal parallel performance have historically been on the opposite side of a trade-off, forcing the user to choose. With the advent of the Big Data era and the rapid evolution of sequential algorithms, the data analytics community can no longer afford the trade-off. We observed that several clustering algorithms often share common traits—particularly, algorithms belonging to the same class of clustering exhibit significant overlap in processing steps. Here, we present our observation on domain patterns in representative-based clustering algorithms and how they manifest as clearly identifiable programming patterns when mapped to a Domain Specific Language (DSL). We have integrated the signatures of these patterns in the DSL compiler for parallelism identification and automatic parallel code generation. The compiler either generates MPI C++ code for distributed memory parallel processing or MPI–OpenMP C++ code for hybrid memory parallel processing, depending upon the target architecture. Our experiments on different state-of-the-art parallelization frameworks show that our system can achieve near-optimal speedup while requiring a fraction of the programming effort, making it an ideal choice for the data analytics community. Results are presented for both distributed and hybrid memory systems.
bitsa_nlp@LT-EDI-ACL2022: Leveraging Pretrained Language Models for Detecting Homophobia and Transphobia in Social Media Comments
(2022) Goyal, Poonam
Online social networks are ubiquitous and user-friendly. Nevertheless, it is vital to detect and moderate offensive content to maintain decency and empathy. However, mining social media texts is a complex task since users don't adhere to any fixed patterns. Comments can be written in any combination of languages and many of them may be low-resource. In this paper, we present our system for the LT-EDI shared task on detecting homophobia and transphobia in social media comments. We experiment with a number of monolingual and multilingual transformer based models such as mBERT along with a data augmentation technique for tackling class imbalance. Such pretrained large models have recently shown tremendous success on a variety of benchmark tasks in natural language processing. We observe their performance on a carefully annotated, real life dataset of YouTube comments in English as well as Tamil. Our submission achieved ranks 9, 6 and 3 with a macro-averaged F1-score of 0.42, 0.64 and 0.58 in the English, Tamil and Tamil-English subtasks respectively. The code for the system has been open sourced.
A blockchain and deep neural networks-based secure framework for enhanced crop protection
(Elsevier, 2021-08) Goyal, Navneet; Goyal, Poonam; Chamola, Vinay
The problem faced by one farmer can also be the problem of some other farmer in other regions. Providing information to farmers and connecting them has always been a challenge. Crowdsourcing and community building are considered as useful solutions to these challenges. However, privacy concerns and inactivity of users can make these models inefficient. To tackle these challenges, we present a cost-efficient and blockchain-based secure framework for building a community of farmers and crowdsourcing the data generated by them to help the farmers’ community. Apart from ensuring privacy and security of data, a revenue model is also incorporated to provide incentives to farmers. These incentives would act as a motivating factor for the farmers to willingly participate in the process. Through integration of a deep neural network-based model to our proposed framework, prediction of any abnormalities present within the crops and their predicted possible solutions would be much more coherent. The simulation results demonstrate that the prediction of plant pathology model is highly accurate.
Concept based query recommendation
(ACM Digital Library, 2011) Goyal, Poonam
For a search engine, the challenge of finding relevant information from the web is becoming more and more difficult with rapid increase/change in content of the web. This difficulty further increases as queries submitted by users are general, imprecise, short and ambiguous. Relevance between user's information need and documents returned by search engine is largely dependent on the query given by them. In this paper, we have proposed a method to facilitate users with query recommendations which are the concepts related to their information needs. In this work, we have extracted concepts from the web snippets and we have proposed two weight functions to measure the relevance between query and concepts. Related concepts with different meaning are selected and recommended as query suggestions. To evaluate our method, we have used a Google middleware for the extraction of concepts. We have estimated the relevance between the query and concepts using the proposed weight functions and compared with the support of the concepts as well as with the TFIDF approach using the standard information-retrieval metrics of precision and Mean Average Precision(MAP). We show that our approach leads to gains in average precision than the other existing approach for different type of queries.
A concurrent k-NN search algorithm for R-tree
(ACM Digital Library, 2015-10) Goyal, Navneet; Goyal, Poonam; Challa, Jagat Sesh
k-nearest neighbor (k-NN) search is one of the commonly used query in database systems. It has its application in various domains like data mining, decision support systems, information retrieval, multimedia and spatial databases, etc. When k-NN search is performed over large data sets, spatial data indexing structures such as R-trees are commonly used to improve query efficiency. The best-first k-NN (BF-kNN) algorithm is the fastest known k-NN over R-trees. We present CBF-kNN, a concurrent BF-kNN for R-trees, which is the first concurrent version of k-NN we know of for R-trees. CBF-kNN uses one of the most efficient concurrent priority queues known as mound. CBF-kNN overcomes the concurrency limitations of priority queues by using a tree-parallel mode of execution. CBF-kNN has an estimated speedup of O(p/k) for p threads. Experimental results on various real datasets show that the speedup in practice is close to this estimate.
CranGAN: Adversarial Point Cloud Reconstruction for patient-specific Cranial Implant Design
(IEEE, 2022) Goyal, Poonam
Automatizing cranial implant design has become an increasingly important avenue in biomedical research. Benefits in terms of financial resources, time and patient safety necessitate the formulation of an efficient and accurate procedure for the same. This paper attempts to provide a new research direction to this problem, through an adversarial deep learning solution. Specifically, in this work, we present CranGAN - a 3D Conditional Generative Adversarial Network designed to reconstruct a 3D representation of a complete skull given its defective counterpart. A novel solution of employing point cloud representations instead of conventional 3D meshes and voxel grids is proposed. We provide both qualitative and quantitative analysis of our experiments with three separate GAN objectives, and compare the utility of two 3D reconstruction loss functions viz. Hausdorff Distance and Chamfer Distance. We hope that our work inspires further research in this direction. Clinical relevance— This paper establishes a new research direction to assist in automated implant design for cranioplasty.
DD-Rtree: A dynamic distributed data structure for efficient data distribution among cluster nodes for spatial data mining algorithms
(IEEE, 2016) Goyal, Navneet; Goyal, Poonam; Challa, Jagat Sesh
Parallelizing data mining algorithms has become a necessity as we try to mine ever increasing volumes of data. Spatial data mining algorithms like Dbscan, Optics, Slink, etc. have been parallelized to exploit a cluster infrastructure. The efficiency achieved by existing algorithms can be attributed to spatial locality preservation using spatial indexing structures like k-d-tree, quad-tree, grid files, etc. for distributing data among cluster nodes. However, these indexing structures are static in nature, i.e., they need to scan the entire dataset to determine the partitioning coordinates. This results in high data distribution cost when the data size is large. In this paper, we propose a dynamic distributed data structure, DD-Rtree, which preserves spatial locality while distributing data across compute nodes in a shared nothing environment. Moreover, DD-Rtree is dynamic, i.e., it can be constructed incrementally making it useful for handling big data. We compare the quality of data distribution achieved by DD-Rtree with one of the recent distributed indexing structure, SD-Rtree. We also compare the efficiency of queries supported by these indexing structures along with the overall efficiency of DBSCAN algorithm. Our experimental results show that DD-Rtree achieves better data distribution and thereby resulting in improved overall efficiency.
Designing self-adaptive websites using online hotlink assignment algorithm
(ACM Digital Library, 2009-12) Goyal, Navneet; Goyal, Poonam
An online hotlink assignment algorithm is proposed for designing adaptive websites. The objective is to reach desired pages on a website in minimum number of clicks, thereby reducing the load on the web server. As a consequence, the traffic on the internet is also reduced. The hotlinks are assigned based on the frequency of access of pages. We model a website as a single source directed graph. Optimal hotlink assignment problem is NP-hard for general graphs. The website graph is reduced to a Breadth First Search (BFS) tree which maintains the semantic relationships between web pages. The proposed online algorithm can place at most k hotlinks per page with a maximum of l hotlinks on the entire website, where k«l. The input stream is simulated using the Zipf distribution. The results presented in the paper compare the performance of the online algorithm with the optimal offline algorithm.
A Domain Specific Language for Clustering
(Springer, 2016-11) Goyal, Navneet; Goyal, Poonam
Clustering of large volumes of data is a complex problem which requires use of sophisticated algorithms as well as High Performance Computing hardware like a cluster of computers. It is highly desirable that data mining experts have a solution which on one hand provides a simple interface for ex-pressing their algorithms in terms of domain specific idioms and on the other hand automatically generates parallel code that can run on a cluster of multicore nodes. The proposed Domain Specific Language (DSL) along with its parallelizing compiler attempts to provide a solution. In this paper, we give the design of the DSL, called DWARF. Various language constructs have been described along with the rationale behind their inclusion in the language. A qualitative comparison of abstraction provided by DWARF is compared with MapReduce, Spark, and other MPI-based implementations to establish the usefulness of the proposed clustering DSL.
An Efficient Density Based Incremental Clustering Algorithm in Data Warehousing Environment
(IPCSIT, 2009) Goyal, Navneet; Goyal, Poonam
Data Warehouses are a good source of data for downstream data mining applications. New data arrives in data warehouses during the periodic refresh cycles. Appending of data on existing data requires that all patterns discovered earlier using various data mining algorithms are updated with each refresh. In this paper, we present an incremental density based clustering algorithm. Incremental DBSCAN is an existing incremental algorithm in which data can be added/deleted to/from existing clusters, one point at a time. Our algorithm is capable of adding points in bulk to existing set of clusters. In this new algorithm, the data points to be added are first clustered using the DBSCAN algorithm and then these new clusters are merged with existing clusters, to come up with the modified set of clusters. That is, we add the clusters incrementally rather than adding points incrementally. It is found that the proposed incremental clustering algorithm produces the same clusters as obtained by Incremental DBSCAN. We have used R*-trees as the data structure to hold the multidimensional data that we need to cluster. One of the major advantages of the proposed approach is that it allows us to see the clustering patterns of the new data along with the existing clustering patterns. Moreover, we can see the merged clusters as well. The proposed algorithm is capable of considerable savings, in terms of region queries performed, as compared to incremental DBSCAN. Results are presented to support the claim
An efficient method for batch updates in OPTICS cluster ordering
(Inder Science, 2018) Goyal, Poonam; Goyal, Navneet
DBSCAN is one of the popular density-based clustering algorithms, but requires re-clustering the entire data when the input parameters are changed. OPTICS overcomes this limitation. In this paper, we propose a batch-wise incremental OPTICS algorithm which performs efficient insertion and deletion of a batch of points in a hierarchical cluster ordering, which is the output of OPTICS. Only a couple of algorithms are available in the literature on incremental versions of OPTICS. This can be attributed to the sequential access patterns of OPTICS. The existing incremental algorithms address the problem of incrementally updating the hierarchical cluster ordering for point-wise insertion/deletion, but these algorithms are only good for infrequent updates. The proposed incremental OPTICS algorithm performs batch-wise insertions/deletions and is suitable for frequent updates. It produces exactly the same hierarchical cluster ordering as that of classical OPTICS. Real datasets have been used for experimental evaluation of the proposed algorithm and results show remarkable performance improvement over the classical and other existing incremental OPTICS algorithms.
Efficient Representation Learning of Satellite Image Time Series and Their Fusion for Spatiotemporal Applications
(Association for the Advancement of Artificial Intelligence, 2024) Goyal, Poonam; Goyal, Navneet
Satellite data bolstered by their increasing accessibility is leading to many endeavors of automated monitoring of the earth's surface for various applications. Such applications demand high spatial resolution images at a temporal resolution of a few days which entails the challenge of processing a huge volume of image time series data. To overcome this computing bottleneck, we present PatchNet, a bespoke adaptation of beam search and attention mechanism. PatchNet is an automated patch selection neural network that requires only a partial spatial traversal of an image time series and yet achieves impressive results. Satellite systems face a trade-off between spatial and temporal resolutions due to budget/technical constraints e.g., Landsat-8/9 or Sentinel-2 have high spatial resolution whereas, MODIS has high temporal resolution. To deal with the limitation of coarse temporal resolution, we propose FuSITSNet, a twofold feature-based generic fusion model with multimodal learning in a contrastive setting. It produces a learned representation after fusion of two satellite image time series leveraging finer spatial resolution of Landsat and finer temporal resolution of MODIS. The patch alignment module of FuSITSNet aligns the PatchNet processed patches of Landsat-8 with the corresponding MODIS regions to incorporate its finer resolution temporal features. The untraversed patches are handled by the cross-modality attention which highlights additional hot spot features from the two modalities. We conduct extensive experiments on more than 2000 counties of US for crop yield, snow cover, and solar energy prediction and show that even one-fourth spatial processing of image time series produces state-of-the-art results. FuSITSNet outperforms the predictions of single modality and data obtained using existing generative fusion models and allows for monitoring of dynamic phenomena using freely accessible images, thereby unlocking new opportunities.