Browsing by Author "Goyal, Navneet"

Now showing 1 - 20 of 66

An Adaptive Hierarchical Method for Anytime Set-wise Clustering of Variable and High-Speed Data Streams
(IEEE, 2023) Challa, Jagat Sesh; Goyal, Poonam; Goyal, Navneet
Set-wise Clustering is a clustering technique for data streams that groups sets of objects based on distribution patterns, applicable in contexts like retail chain clustering, text-based community clustering, restaurant categorization, etc. The existing set-wise clustering method cannot handle variable and high-speed streams with reasonable accuracy. This paper presents an Anytime Set-wise Clustering method for data streams known as ANYSETCLUS. The method handles the variable inter-arrival rates of stream objects using a proposed indexing structure called AnySetClusTree, which stores a hierarchy of micro-clusters of multi-set entities at varying granularity. ANYSETCLUS is highly adaptive as it supports incremental model updates, segregates outliers, enables outlier-to-concept transition, and captures concept drift. The method also enables anytime offline clustering wherein it can generate multiple clusterings of varying granularity and purity depending upon the available time allowance for final clustering. The experimental results affirm the superior efficacy of the proposed method in handling variable and high-speed streams compared to the state-of-the-art method. The experimental results also showcase its effectiveness in achieving significantly higher micro-cluster purity for low and high-speed streams. This contrasts with the state-of-the-art method, which is unable to generate valid clustering results for high-speed streams. The experiments further validate the proposed method’s capability for anytime offline clustering.
Android Web Security Solution using Cross-device Federated Learning
(IEEE, 2022) Goyal, Navneet
Over the last one decade or so, Machine Learning has changed the global technology landscape with applications in almost all disciplines and verticals. Mobile and Web Security is an important research area in which researchers have been trying to apply Machine Learning, but data privacy concerns and high data communication costs to a central Machine Learning server have limited its use. Federated Learning is emerging as a promising solution which addresses privacy concerns and drastically reduces communication costs. In Federated Learning, data from individual devices is not communicated to a central server and model learning happens in a distributed manner. In this paper, we propose a Federated Learning solution for security of Android based devices. Mobile and Web Security solutions have evolved from signature-based detections to building Machine Learning models which are trained over large centralized malware repositories. We have used Federated Learning to learn security patterns from users' browsing data, which resides on individual devices and will never leave the devices. Federated Learning preserves users' privacy as it shares with the central server only the model that it learns from users' browsing data, and not the data itself. This way each mobile platform trains its own web security model from its data, and shares it to the centralized server. The centralized server aggregates these trained models received from numerous mobile devices and compiles an aggregated global model, which in turn is sent to mobile devices for inference. Mobile security solutions based on this concept create a sustained self-evolving security ecosystem, in which millions of mobile platforms share their learned models to form a robust distributed security paradigm. The results obtained using Federated Learning are found to be comparable with the results of centralized Machine Learning.
AnyFI: An Anytime Frequent Itemset Mining Algorithm for Data Streams
(IEEE, 2017) Goyal, Navneet; Goyal, Poonam; Challa, Jagat Sesh
Mining frequent itemsets from transactional data streams has been vastly studied in literature. The existing algorithms mine frequent itemsets within the stream's constrained environment of limited time and memory. However, none of them are capable of handling varying inter-arrival rates of streams. Moreover, these algorithms are not capable of giving mining results instantaneously, even with compromised accuracy if required, and improve the accuracy with increase in time allowance. These two properties characterize an anytime algorithm. In this paper, we propose AnyFI, which is the first anytime frequent itemset mining algorithm for data streams. We also propose a novel data structure, BFI-forest, which is capable of handling transactions with varying inter-arrival rate. AnyFI maintains itemsets in BFI-forest in such a way that it can give a mining result almost immediately when time allowance to mine is very less and can refine the results for better accuracy with increase in time allowance. Our experimental results show that AnyFI can handle high stream speeds upto 60,000 transactions per second (tps) with recall close to 100%.
AnySC: Anytime Set-wise Classification of Variable Speed Data Streams
(IEEE, 2018-12) Goyal, Navneet; Goyal, Poonam; Challa, Jagat Sesh
Classification of data streams has gained a lot of popularity in recent years owing to its multiple applications. In certain applications like community detection from text feeds, website fingerprinting attack, etc., it is more meaningful to associate class labels with groups of objects rather than the individual objects. This kind of classification problem is known as the set-wise classification problem. The few algorithms available in literature for this problem are budget algorithms, i.e. they are designed to process fixed maximum stream speed, and are not capable of handling variable and high speed streams. We present ANYSC which is the first anytime set-wise classification algorithm for data streams. ANYSC handles variable inter-arrival rate of objects in the stream and performs classification of test entities within any available time allowance, using a proposed data structure referred to as CProf-forest. The experimental results show that ANYSC brings in the features of an anytime algorithm and outperforms the existing approaches.
AnyStreamKM: Anytime k-medoids Clustering for Streaming Data
(IEEE, 2022) Challa, Jagat Sesh; Goyal, Navneet; Goyal, Poonam
Stream Clustering algorithms have gained a lot of importance in the recent past due to rapid rising utilities of IoT systems and applications. Anytime algorithms and frameworks play a key role in handling streams that have data arriving/generating at variable rates. They are capable of handling both slow and fast stream speeds, at the same time generate the result with highest possible accuracy. In this paper, we present AnyStreamKM, which is a framework for anytime k-medoids clustering of data streams. It uses a proposed hierarchical data indexing structure known as AnyKMTree that stores the incoming data from the stream in the form of hierarchy of micro-clusters. AnyKMTree is an adaptation of R-tree with its splitting strategy inspired from the design principles of k-medoids clustering. AnyKMTree not only supports anytime features but is also capable of filtering out noise and outliers. Our experimental analysis establishes that AnyKMTree produces micro-clusters that are more compact and purer than the state-of-the-art methods. Also, when offline k-medoids clustering such as PAM (Partitioning Around Medoids) is applied on the micro-clusters produced by AnyKMTree, the resultant clustering has been found to be of higher quality than the state-of-the-art methods.
Anytime clustering of data streams while handling noise and concept drift
(Taylor & Francis, 2021-03) Goyal, Poonam; Goyal, Navneet; Challa, Jagat Sesh
Clustering of data streams has become very popular in recent times, owing to rapid rise of real-time streaming utilities that produce large amounts of data at varying inter-arrival rates. We propose AnyClus, a framework for anytime clustering of data streams. AnyClus uses a proposed variant of R-tree, AnyRTree, to capture the incoming stream objects arriving at variable rate, and to index them in the form of micro-clusters of hierarchical fashion. The leaf-level micro-clusters produced are aggregated and stored in a logarithmic tilted-time window framework (TTWF). Our extensive experimental analysis shows (i) the capability of AnyClus in handling variable stream speeds (upto 250k objects/second); (ii) its ability to produce micro-clusters of high purity (≈1) and compactness; (iii) effectiveness of AnyRTree in handling noise, capturing concept drift and preservation of spatial locality in the indexing of micro-clusters, when compared to the existing methods. We also propose a parallel framework, Any-MP-Clus, for anytime clustering of multiport data streams over commodity clusters. Any-MP-Clus uses AnyRTree at each computing node of the cluster (for each stream-port) and maintains the aggregated micro-clusters in TTWF. The experimental results on datasets of billions scale show that Any-MP-Clus is scalable, eﬃcient and produces clustering of higher quality.
Anytime Frequent Itemset Mining of Transactional Data Streams
(Elsevier, 2020-09) Goyal, Poonam; Goyal, Navneet; Challa, Jagat Sesh
Mining frequent itemsets from transactional data streams has become very essential in today's world with many applications such as stock market analysis, retail chain analysis, web log analysis, etc. Various algorithms have been proposed to efficiently mine single-port and multi-port transactional streams within the constraints of limited time and memory. However, all of them are budget algorithms, i.e., they are not capable of handling varying inter-arrival rate of transactions and high speed streams. They are constrained by a maximum limit to the inter-arrival rate of transactions, beyond which they fail to process. Also, these algorithms are not capable of giving immediate mining results, even with compromised accuracy if required. The above two properties characterize an anytime algorithm. We propose AnyFI, which is the first anytime frequent itemset mining algorithm for data streams. AnyFI uses a novel data structure - BFI-forest, which is capable of handling transactions arriving at variable rate. It maintains itemsets in BFI-forest in such a way that it can give a mining result almost immediately when the time allowance to mine is very less and can refine its accuracy with increase in time allowance. We also propose MPAnyFI which extends AnyFI into a parallel framework for anytime frequent itemset mining of multi-port data streams over commodity clusters. It uses AnyFI at each computing node of the cluster. Our extensive experimental analysis shows that AnyFI can handle high stream speeds close to 60,000 trans/sec with recall close to 100%. They also show the efficiency of MPAnyFI.
Automatic parallelization of representative-based clustering algorithms for multicore cluster systems
(Springer, 2020-03) Goyal, Navneet; Goyal, Poonam
Ease of programming and optimal parallel performance have historically been on the opposite side of a trade-off, forcing the user to choose. With the advent of the Big Data era and the rapid evolution of sequential algorithms, the data analytics community can no longer afford the trade-off. We observed that several clustering algorithms often share common traits—particularly, algorithms belonging to the same class of clustering exhibit significant overlap in processing steps. Here, we present our observation on domain patterns in representative-based clustering algorithms and how they manifest as clearly identifiable programming patterns when mapped to a Domain Specific Language (DSL). We have integrated the signatures of these patterns in the DSL compiler for parallelism identification and automatic parallel code generation. The compiler either generates MPI C++ code for distributed memory parallel processing or MPI–OpenMP C++ code for hybrid memory parallel processing, depending upon the target architecture. Our experiments on different state-of-the-art parallelization frameworks show that our system can achieve near-optimal speedup while requiring a fraction of the programming effort, making it an ideal choice for the data analytics community. Results are presented for both distributed and hybrid memory systems.
Big Data and Artificial Intelligenc
(Springer, 2023) Goyal, Navneet
This book constitutes the proceedings of the 11th International Conference on Big Data and Artificial Intelligence, BDA 2023, held in Delhi, India, during December 7–9, 2023. The17 full papers presented in this volume were carefully reviewed and selected from 67 submissions. The papers are organized in the following topical sections: Keynote Lectures, Artificial Intelligence in Healthcare, Large Language Models, Data Analytics for Low Resource Domains, Artificial Intelligence for Innovative Applications and Potpourri.
Big social data provenance framework for Zero-Information Loss Key-Value Pair (KVP) Database
(Springer, 2021-11) Goyal, Navneet
Social media has been playing a vital importance in information sharing at massive scale due to its easy access, low cost, and faster dissemination of information. Its competence to disseminate the information across a wide audience has raised a critical challenge to determine the social data provenance of digital content. Social Data Provenance describes the origin, derivation process, and transformations of social content throughout its lifecycle. In this paper, we present a Big Social Data Provenance (BSDP) Framework for key-value pair (KVP) database using the novel concept of Zero-Information Loss Database (ZILD). In our proposed framework, a huge volume of social data is first fetched from the social media (Twitter’s Network) through live streaming and simultaneously modelled in a KVP database by using a query-driven approach. The proposed framework is capable in capturing, storing, and querying provenance information for different query sets including select, aggregate, standing/historical, and data update (i.e., insert, delete, update) queries on Big Social Data. We evaluate the performance of proposed framework in terms of provenance capturing overhead for different query sets including select, aggregate, and data update queries, and average execution time for various provenance queries.
A blockchain and deep neural networks-based secure framework for enhanced crop protection
(Elsevier, 2021-08) Goyal, Navneet; Goyal, Poonam; Chamola, Vinay
The problem faced by one farmer can also be the problem of some other farmer in other regions. Providing information to farmers and connecting them has always been a challenge. Crowdsourcing and community building are considered as useful solutions to these challenges. However, privacy concerns and inactivity of users can make these models inefficient. To tackle these challenges, we present a cost-efficient and blockchain-based secure framework for building a community of farmers and crowdsourcing the data generated by them to help the farmers’ community. Apart from ensuring privacy and security of data, a revenue model is also incorporated to provide incentives to farmers. These incentives would act as a motivating factor for the farmers to willingly participate in the process. Through integration of a deep neural network-based model to our proposed framework, prediction of any abnormalities present within the crops and their predicted possible solutions would be much more coherent. The simulation results demonstrate that the prediction of plant pathology model is highly accurate.
A Comparison of Machine Learning Attributes for Detecting Malicious Websites
(IEEE, 2019-01) Goyal, Navneet
The number of Malicious Websites has increased manifold in the past few years. As on start of year 2018, 1 in every 13 URL was malicious, amounting to 7.8% URLs identified as malicious [1]. These figures have increased by 2.8%, thereby showing an increasing trend of attack vectors through Malicious Websites. These statistics clearly highlight the need to detect Malicious Websites on the Internet. Many research works have suggested Machine Learning techniques to detect Malicious Websites. Research has also been done to compare Machine Learning algorithms for their detection. However, the aspect of attribute selection for detecting Malicious Websites using Machine Learning has not been delved in detail. In Machine Learning techniques, attribute selection outweighs the importance of any other aspect in the process. Thus, there is a need to compare and analyze the various attributes that can help find Malicious Websites faster and better. This paper is focused to address this research gap, so that, fewer and optimal attributes can do a better job
A concurrent k-NN search algorithm for R-tree
(ACM Digital Library, 2015-10) Goyal, Navneet; Goyal, Poonam; Challa, Jagat Sesh
k-nearest neighbor (k-NN) search is one of the commonly used query in database systems. It has its application in various domains like data mining, decision support systems, information retrieval, multimedia and spatial databases, etc. When k-NN search is performed over large data sets, spatial data indexing structures such as R-trees are commonly used to improve query efficiency. The best-first k-NN (BF-kNN) algorithm is the fastest known k-NN over R-trees. We present CBF-kNN, a concurrent BF-kNN for R-trees, which is the first concurrent version of k-NN we know of for R-trees. CBF-kNN uses one of the most efficient concurrent priority queues known as mound. CBF-kNN overcomes the concurrency limitations of priority queues by using a tree-parallel mode of execution. CBF-kNN has an estimated speedup of O(p/k) for p threads. Experimental results on various real datasets show that the speedup in practice is close to this estimate.
DD-Rtree: A dynamic distributed data structure for efficient data distribution among cluster nodes for spatial data mining algorithms
(IEEE, 2016) Goyal, Navneet; Goyal, Poonam; Challa, Jagat Sesh
Parallelizing data mining algorithms has become a necessity as we try to mine ever increasing volumes of data. Spatial data mining algorithms like Dbscan, Optics, Slink, etc. have been parallelized to exploit a cluster infrastructure. The efficiency achieved by existing algorithms can be attributed to spatial locality preservation using spatial indexing structures like k-d-tree, quad-tree, grid files, etc. for distributing data among cluster nodes. However, these indexing structures are static in nature, i.e., they need to scan the entire dataset to determine the partitioning coordinates. This results in high data distribution cost when the data size is large. In this paper, we propose a dynamic distributed data structure, DD-Rtree, which preserves spatial locality while distributing data across compute nodes in a shared nothing environment. Moreover, DD-Rtree is dynamic, i.e., it can be constructed incrementally making it useful for handling big data. We compare the quality of data distribution achieved by DD-Rtree with one of the recent distributed indexing structure, SD-Rtree. We also compare the efficiency of queries supported by these indexing structures along with the overall efficiency of DBSCAN algorithm. Our experimental results show that DD-Rtree achieves better data distribution and thereby resulting in improved overall efficiency.
Designing self-adaptive websites using online hotlink assignment algorithm
(ACM Digital Library, 2009-12) Goyal, Navneet; Goyal, Poonam
An online hotlink assignment algorithm is proposed for designing adaptive websites. The objective is to reach desired pages on a website in minimum number of clicks, thereby reducing the load on the web server. As a consequence, the traffic on the internet is also reduced. The hotlinks are assigned based on the frequency of access of pages. We model a website as a single source directed graph. Optimal hotlink assignment problem is NP-hard for general graphs. The website graph is reduced to a Breadth First Search (BFS) tree which maintains the semantic relationships between web pages. The proposed online algorithm can place at most k hotlinks per page with a maximum of l hotlinks on the entire website, where k«l. The input stream is simulated using the Zipf distribution. The results presented in the paper compare the performance of the online algorithm with the optimal offline algorithm.
Detection of Malicious Webpages Using Deep Learning.
(IEEE, 2021) Goyal, Navneet
Malicious Webpages have been a serious threat on Internet for the past few years. As per the latest Google Transparency reports, they continue to be top ranked amongst online threats. Various techniques have been used till date to identify malicious sites, to include, Static Heuristics, Honey Clients, Machine Learning, etc. Recently, with the rapid rise of Deep Learning, an interest has aroused to explore Deep Learning techniques for detecting Malicious Webpages. In this paper Deep Learning has been utilized for such classification. The model proposed in this research has used a Deep Neural Network (DNN) with two hidden layers to distinguish between Malicious and Benign Webpages. This DNN model gave high accuracy of 99.81% with very low False Positives (FP) and False Negatives (FN), and with near real-time response on test sample. The model outperformed earlier machine learning solutions in accuracy, precision, recall and time performance metrics.
Digital image analysis of gas bypassing and mixing in gas-fluidized bed: effect of particle shape
(Wiley, 2024-10) Mohanta, Hare Krishna; Goyal, Navneet; Sande, Priya Christina; Sharma, Arvind Kumar
The study investigates effect of particle shape on gas bypassing and mixing of gas-fluidized Geldart A particles. A shallow fluidized bed (FB), configured at benchscale, was used with digital image analysis (DIA) for the investigation. The extent of scatter of tracer particles throughout the bed was assessed from DIA images of defluidized powder. A novel method employing Jupyter notebook software, was used to directly determine Mixing Index from digital images. Remarkably, platelet-shaped China clay powder displayed the best mixing characteristics (Mixing Index: 0.79) with no significant bypassing. Angular shaped Quartz displayed moderate mixing (Mixing Index: 0.67), but high bypassing (Bypassing Index: 0.75). Contrary to conventional assumptions, spherical-shaped diatomite exhibited poor mixing (Mixing Index: 0.61) with the highest bypassing (Bypassing Index: 0.82). Platelet particles performed well even with fines removal. Most likely, particle shape significantly influenced the number of available particle contact points, tracer migration, and traceronparticle binding.
A Domain Specific Language for Clustering
(Springer, 2016-11) Goyal, Navneet; Goyal, Poonam
Clustering of large volumes of data is a complex problem which requires use of sophisticated algorithms as well as High Performance Computing hardware like a cluster of computers. It is highly desirable that data mining experts have a solution which on one hand provides a simple interface for ex-pressing their algorithms in terms of domain specific idioms and on the other hand automatically generates parallel code that can run on a cluster of multicore nodes. The proposed Domain Specific Language (DSL) along with its parallelizing compiler attempts to provide a solution. In this paper, we give the design of the DSL, called DWARF. Various language constructs have been described along with the rationale behind their inclusion in the language. A qualitative comparison of abstraction provided by DWARF is compared with MapReduce, Spark, and other MPI-based implementations to establish the usefulness of the proposed clustering DSL.
Effect Of Transverse Shear And Rotatory Inertia On The Forced Motion Of A Plate-Strip Of Linearly Varying Thickness
(Elsevier, 1994-07) Goyal, Navneet
Shear theory and the eigenfunction method are used to analyze the forced motion of a plate-strip of linearly varying thickness. A plate clamped at both edges and a cantilever plate subjected to uniformly distributed and concentrated impulsive loads are analyzed as example problems. Numerical results computed for transverse deflection and bending moment of the plate are compared with those of classical theory.
An Efficient Density Based Incremental Clustering Algorithm in Data Warehousing Environment
(IPCSIT, 2009) Goyal, Navneet; Goyal, Poonam
Data Warehouses are a good source of data for downstream data mining applications. New data arrives in data warehouses during the periodic refresh cycles. Appending of data on existing data requires that all patterns discovered earlier using various data mining algorithms are updated with each refresh. In this paper, we present an incremental density based clustering algorithm. Incremental DBSCAN is an existing incremental algorithm in which data can be added/deleted to/from existing clusters, one point at a time. Our algorithm is capable of adding points in bulk to existing set of clusters. In this new algorithm, the data points to be added are first clustered using the DBSCAN algorithm and then these new clusters are merged with existing clusters, to come up with the modified set of clusters. That is, we add the clusters incrementally rather than adding points incrementally. It is found that the proposed incremental clustering algorithm produces the same clusters as obtained by Incremental DBSCAN. We have used R*-trees as the data structure to hold the multidimensional data that we need to cluster. One of the major advantages of the proposed approach is that it allows us to see the clustering patterns of the new data along with the existing clustering patterns. Moreover, we can see the merged clusters as well. The proposed algorithm is capable of considerable savings, in terms of region queries performed, as compared to incremental DBSCAN. Results are presented to support the claim