Browsing by Author "Challa, Jagat Sesh"

Now showing 1 - 20 of 25

An Adaptive Hierarchical Method for Anytime Set-wise Clustering of Variable and High-Speed Data Streams
(IEEE, 2023) Challa, Jagat Sesh; Goyal, Poonam; Goyal, Navneet
Set-wise Clustering is a clustering technique for data streams that groups sets of objects based on distribution patterns, applicable in contexts like retail chain clustering, text-based community clustering, restaurant categorization, etc. The existing set-wise clustering method cannot handle variable and high-speed streams with reasonable accuracy. This paper presents an Anytime Set-wise Clustering method for data streams known as ANYSETCLUS. The method handles the variable inter-arrival rates of stream objects using a proposed indexing structure called AnySetClusTree, which stores a hierarchy of micro-clusters of multi-set entities at varying granularity. ANYSETCLUS is highly adaptive as it supports incremental model updates, segregates outliers, enables outlier-to-concept transition, and captures concept drift. The method also enables anytime offline clustering wherein it can generate multiple clusterings of varying granularity and purity depending upon the available time allowance for final clustering. The experimental results affirm the superior efficacy of the proposed method in handling variable and high-speed streams compared to the state-of-the-art method. The experimental results also showcase its effectiveness in achieving significantly higher micro-cluster purity for low and high-speed streams. This contrasts with the state-of-the-art method, which is unable to generate valid clustering results for high-speed streams. The experiments further validate the proposed method’s capability for anytime offline clustering.
AnyFI: An Anytime Frequent Itemset Mining Algorithm for Data Streams
(IEEE, 2017) Goyal, Navneet; Goyal, Poonam; Challa, Jagat Sesh
Mining frequent itemsets from transactional data streams has been vastly studied in literature. The existing algorithms mine frequent itemsets within the stream's constrained environment of limited time and memory. However, none of them are capable of handling varying inter-arrival rates of streams. Moreover, these algorithms are not capable of giving mining results instantaneously, even with compromised accuracy if required, and improve the accuracy with increase in time allowance. These two properties characterize an anytime algorithm. In this paper, we propose AnyFI, which is the first anytime frequent itemset mining algorithm for data streams. We also propose a novel data structure, BFI-forest, which is capable of handling transactions with varying inter-arrival rate. AnyFI maintains itemsets in BFI-forest in such a way that it can give a mining result almost immediately when time allowance to mine is very less and can refine the results for better accuracy with increase in time allowance. Our experimental results show that AnyFI can handle high stream speeds upto 60,000 transactions per second (tps) with recall close to 100%.
AnySC: Anytime Set-wise Classification of Variable Speed Data Streams
(IEEE, 2018-12) Goyal, Navneet; Goyal, Poonam; Challa, Jagat Sesh
Classification of data streams has gained a lot of popularity in recent years owing to its multiple applications. In certain applications like community detection from text feeds, website fingerprinting attack, etc., it is more meaningful to associate class labels with groups of objects rather than the individual objects. This kind of classification problem is known as the set-wise classification problem. The few algorithms available in literature for this problem are budget algorithms, i.e. they are designed to process fixed maximum stream speed, and are not capable of handling variable and high speed streams. We present ANYSC which is the first anytime set-wise classification algorithm for data streams. ANYSC handles variable inter-arrival rate of objects in the stream and performs classification of test entities within any available time allowance, using a proposed data structure referred to as CProf-forest. The experimental results show that ANYSC brings in the features of an anytime algorithm and outperforms the existing approaches.
AnyStreamKM: Anytime k-medoids Clustering for Streaming Data
(IEEE, 2022) Challa, Jagat Sesh; Goyal, Navneet; Goyal, Poonam
Stream Clustering algorithms have gained a lot of importance in the recent past due to rapid rising utilities of IoT systems and applications. Anytime algorithms and frameworks play a key role in handling streams that have data arriving/generating at variable rates. They are capable of handling both slow and fast stream speeds, at the same time generate the result with highest possible accuracy. In this paper, we present AnyStreamKM, which is a framework for anytime k-medoids clustering of data streams. It uses a proposed hierarchical data indexing structure known as AnyKMTree that stores the incoming data from the stream in the form of hierarchy of micro-clusters. AnyKMTree is an adaptation of R-tree with its splitting strategy inspired from the design principles of k-medoids clustering. AnyKMTree not only supports anytime features but is also capable of filtering out noise and outliers. Our experimental analysis establishes that AnyKMTree produces micro-clusters that are more compact and purer than the state-of-the-art methods. Also, when offline k-medoids clustering such as PAM (Partitioning Around Medoids) is applied on the micro-clusters produced by AnyKMTree, the resultant clustering has been found to be of higher quality than the state-of-the-art methods.
Anytime clustering of data streams while handling noise and concept drift
(Taylor & Francis, 2021-03) Goyal, Poonam; Goyal, Navneet; Challa, Jagat Sesh
Clustering of data streams has become very popular in recent times, owing to rapid rise of real-time streaming utilities that produce large amounts of data at varying inter-arrival rates. We propose AnyClus, a framework for anytime clustering of data streams. AnyClus uses a proposed variant of R-tree, AnyRTree, to capture the incoming stream objects arriving at variable rate, and to index them in the form of micro-clusters of hierarchical fashion. The leaf-level micro-clusters produced are aggregated and stored in a logarithmic tilted-time window framework (TTWF). Our extensive experimental analysis shows (i) the capability of AnyClus in handling variable stream speeds (upto 250k objects/second); (ii) its ability to produce micro-clusters of high purity (≈1) and compactness; (iii) effectiveness of AnyRTree in handling noise, capturing concept drift and preservation of spatial locality in the indexing of micro-clusters, when compared to the existing methods. We also propose a parallel framework, Any-MP-Clus, for anytime clustering of multiport data streams over commodity clusters. Any-MP-Clus uses AnyRTree at each computing node of the cluster (for each stream-port) and maintains the aggregated micro-clusters in TTWF. The experimental results on datasets of billions scale show that Any-MP-Clus is scalable, eﬃcient and produces clustering of higher quality.
Anytime Frequent Itemset Mining of Transactional Data Streams
(Elsevier, 2020-09) Goyal, Poonam; Goyal, Navneet; Challa, Jagat Sesh
Mining frequent itemsets from transactional data streams has become very essential in today's world with many applications such as stock market analysis, retail chain analysis, web log analysis, etc. Various algorithms have been proposed to efficiently mine single-port and multi-port transactional streams within the constraints of limited time and memory. However, all of them are budget algorithms, i.e., they are not capable of handling varying inter-arrival rate of transactions and high speed streams. They are constrained by a maximum limit to the inter-arrival rate of transactions, beyond which they fail to process. Also, these algorithms are not capable of giving immediate mining results, even with compromised accuracy if required. The above two properties characterize an anytime algorithm. We propose AnyFI, which is the first anytime frequent itemset mining algorithm for data streams. AnyFI uses a novel data structure - BFI-forest, which is capable of handling transactions arriving at variable rate. It maintains itemsets in BFI-forest in such a way that it can give a mining result almost immediately when the time allowance to mine is very less and can refine its accuracy with increase in time allowance. We also propose MPAnyFI which extends AnyFI into a parallel framework for anytime frequent itemset mining of multi-port data streams over commodity clusters. It uses AnyFI at each computing node of the cluster. Our extensive experimental analysis shows that AnyFI can handle high stream speeds close to 60,000 trans/sec with recall close to 100%. They also show the efficiency of MPAnyFI.
A concurrent k-NN search algorithm for R-tree
(ACM Digital Library, 2015-10) Goyal, Navneet; Goyal, Poonam; Challa, Jagat Sesh
k-nearest neighbor (k-NN) search is one of the commonly used query in database systems. It has its application in various domains like data mining, decision support systems, information retrieval, multimedia and spatial databases, etc. When k-NN search is performed over large data sets, spatial data indexing structures such as R-trees are commonly used to improve query efficiency. The best-first k-NN (BF-kNN) algorithm is the fastest known k-NN over R-trees. We present CBF-kNN, a concurrent BF-kNN for R-trees, which is the first concurrent version of k-NN we know of for R-trees. CBF-kNN uses one of the most efficient concurrent priority queues known as mound. CBF-kNN overcomes the concurrency limitations of priority queues by using a tree-parallel mode of execution. CBF-kNN has an estimated speedup of O(p/k) for p threads. Experimental results on various real datasets show that the speedup in practice is close to this estimate.
DD-Rtree: A dynamic distributed data structure for efficient data distribution among cluster nodes for spatial data mining algorithms
(IEEE, 2016) Goyal, Navneet; Goyal, Poonam; Challa, Jagat Sesh
Parallelizing data mining algorithms has become a necessity as we try to mine ever increasing volumes of data. Spatial data mining algorithms like Dbscan, Optics, Slink, etc. have been parallelized to exploit a cluster infrastructure. The efficiency achieved by existing algorithms can be attributed to spatial locality preservation using spatial indexing structures like k-d-tree, quad-tree, grid files, etc. for distributing data among cluster nodes. However, these indexing structures are static in nature, i.e., they need to scan the entire dataset to determine the partitioning coordinates. This results in high data distribution cost when the data size is large. In this paper, we propose a dynamic distributed data structure, DD-Rtree, which preserves spatial locality while distributing data across compute nodes in a shared nothing environment. Moreover, DD-Rtree is dynamic, i.e., it can be constructed incrementally making it useful for handling big data. We compare the quality of data distribution achieved by DD-Rtree with one of the recent distributed indexing structure, SD-Rtree. We also compare the efficiency of queries supported by these indexing structures along with the overall efficiency of DBSCAN algorithm. Our experimental results show that DD-Rtree achieves better data distribution and thereby resulting in improved overall efficiency.
Design and development of data indexing techniques for mining large and streaming data
(BITS Pilani, Pilani Campus, 2019-11-30) Challa, Jagat Sesh
Development of Simulation Program to Optimise Process Parameters of Steam Power Cycles
(IJTEE, 2014) Challa, Jagat Sesh; Srinivasan, P
Conventional coal-based thermal power plants have an average overall efficiency in the range of 35-38 %. Any increase in the percent efficiency of these power plants, is subjected to constraints posed by maximum and minimum temperatures, which are restricted by the creep property of materials and ambient temperature, respectively. Hence, an increase of efficiency beyond certain limits is not possible without optimising the process parameters associated with reheat and regenerative cycles. In this work, an attempt is made to optimise reheat and regenerative cycle process parameters such as, reheat pressure, tapping pressure of bled steam, and mass fraction of bled steam, in order to achieve maximum cycle efficiency. The optimisation of the process parameters was achieved by developing a simulation program using Microsoft Visual Studio. This program takes into account isentropic efficiencies of turbines and pumps and pressure drop in the boiler, and it can be used to simulate the optimum operating conditions of multi-stage reheat & regenerative cycle based thermal power plants. A comparison between the efficiencies of eight kinds of steam power cycles, at optimised conditions, has been made for different boiler pressures and steam temperatures at the turbine inlet. This comparison can aid power plant designers in choosing appropriate steam power cycles for a given set of operating conditions. It is observed that the results obtained from the program, such as, the optimum reheat pressures for two stage reheat cycles and optimum bled steam tapping pressures for two stage regenerative cycles are in good agreement with the published literature.
Grid-R-tree: a data structure for efficient neighborhood and nearest neighbor queries in data mining
(Springer, 2020-04) Goyal, Poonam; Goyal, Navneet; Challa, Jagat Sesh
The use of multi-dimensional indexing structures has gained a lot of attention in data mining. The most commonly used data structures for indexing data are R-tree and its variants, quad-tree, k-d-tree, etc. These data structures support region queries (point, window and neighborhood queries) and nearest neighbor queries. These queries are extensively used in data mining algorithms. Although these data structures facilitate execution of the above queries in logarithmic time, the constraints associated with them become bottleneck in query execution, when used for large and high-dimensional datasets. Moreover, these indexing structures do not cater to specific data access patterns of data mining algorithms. In this paper, we propose a new data structure Grid-R-tree, a grid based R-tree which is specifically designed to address the querying requirements of multiple data mining algorithms. Grid-R-tree is a simple, yet effective adaptation of R-tree using the concept of Grid. We also introduce a new query over Grid-R-tree, called cell-wise epsilon neighborhood query (CellWiseNBH), which captures the locality in query execution pattern of density-based clustering algorithms, and enables us to redesign them for improving their efficiency. Our theoretical and experimental analysis shows that the proposed data structure outperforms the conventional R-tree in terms of neighborhood and nearest neighbor queries. The experiments were conducted on datasets of size up to 100 million and dimensionality up to 74. The results also suggest that Grid-R-tree improves the efficiency of data mining algorithms such as k-nearest neighbor classifier and DBSCAN clustering (including the redesigned version that uses CellWiseNBH). Additionally, an adaptive grid optimization has been applied on dense cells that have number of indexed data points greater than a threshold τ to keep equal load distribution in the cells, which resulted in more efficient query performance for datasets that have skewed distribution of data points.
HADCLEAN: A hybrid approach to data cleaning in data warehouses
(IEEE, 2012) Challa, Jagat Sesh; Sharma, Yashvardhan
Data cleaning is an essential step in populating and maintaining data warehouses. Owing to likely differences in conventions between the external sources and the target data warehouse, as well as due to a variety of errors, data from external sources may not conform to the standards and requirements at the data warehouse. Therefore, data has to be transformed and cleaned before it is loaded into the warehouse so that downstream data analysis is reliable and accurate. This is usually accomplished through an Extract-Transform-Load (ETL) process. Typical data cleaning tasks include record matching, de-duplication, and column segmentation which often go beyond traditional relational operators. This has led to the development of a broad range of methods intending to enhance the accuracy and thereby the usability of existing data. Data cleansing is the first step, and most critical, in a Business Intelligence (BI) or Data Warehousing (DW) project, yet easily the most underestimated. T. Redman [1] suggests that the cost associated with poor quality data is about 8-12% of the revenue of a typical organization. Thus, it is very significant to perform data cleaning process for building any enterprise data warehouse.
The Impact of Large Language Models on K-12 Education in Rural India: A Thematic Analysis of Student Volunteer's Perspectives
(2025-05) Kumar, Dhruv; Challa, Jagat Sesh; Ramachandran, Veena
AI-driven education, particularly Large Language Models (LLMs), has the potential to address learning disparities in rural K-12 schools. However, research on AI adoption in rural India remains limited, with existing studies focusing primarily on urban settings. This study examines the perceptions of volunteer teachers on AI integration in rural education, identifying key challenges and opportunities. Through semi-structured interviews with 23 volunteer educators in Rajasthan and Delhi, we conducted a thematic analysis to explore infrastructure constraints, teacher preparedness, and digital literacy gaps. Findings indicate that while LLMs could enhance personalized learning and reduce teacher workload, barriers such as poor connectivity, lack of AI training, and parental skepticism hinder adoption. Despite concerns over over-reliance and ethical risks, volunteers emphasize that AI should be seen as a complementary tool rather than a replacement for traditional teaching. Given the potential benefits, LLM-based tutors merit further exploration in rural classrooms, with structured implementation and localized adaptations to ensure accessibility and equity.
InFER++: real-world indian facial expression dataset
(IEEE, 2024-08) Challa, Jagat Sesh; Narang, Pratik
Detecting facial expressions is a challenging task in the field of computer vision. Several datasets and algorithms have been proposed over the past two decades; however, deploying them in real-world, in-the-wild scenarios hampers the overall performance. This is because the training data does not completely represent socio-cultural and ethnic diversity; the majority of the datasets consist of American and Caucasian populations. On the contrary, in a diverse and heterogeneous population distribution like the Indian subcontinent, the need for a significantly large enough dataset representing all the ethnic groups is even more critical. To address this, we present InFER++, an India-specific, multi-ethnic, real-world, in-the-wild facial expression dataset consisting of seven basic expressions. To the best of our knowledge, this is the largest India-specific facial expression dataset. Our cross-dataset analysis of RAF-DB vs InFER++ shows that models trained on RAF-DB were not generalizable to ethnic datasets like InFER++. This is because the facial expressions change with respect to ethnic and socio-cultural factors. We also present LiteXpressionNet, a lightweight deep facial expression network that outperforms many existing lightweight models with considerably fewer FLOPs and parameters. The proposed model is inspired by MobileViTv2 architecture, which utilizes GhostNetv2 blocks to increase parametrization while reducing latency and FLOP requirements. The model is trained with a novel objective function that combines early learning regularization and symmetric cross-entropy loss to mitigate human uncertainties and annotation bias in most real-world facial expression datasets.
Integrated Software Quality Evaluation: A Fuzzy Multi-Criteria Approach
(Korea Science, 2011) Singh, Ajit Pratap; Challa, Jagat Sesh
Software measurement is a key factor in managing, controlling, and improving the software development processes. Software quality is one of the most important factors for assessing the global competitive position of any software company. Thus the quantification of quality parameters and integrating them into quality models is very essential. Software quality criteria are not very easily measured and quantified. Many attempts have been made to exactly quantify the software quality parameters using various models such as ISO/IEC 9126 Quality Model, Boehm's Model, McCall's model, etc. In this paper an attempt has been made to provide a tool for precisely quantifying software quality factors with the help of quality factors stated in ISO/IEC 9126 model. Due to the unpredictable nature of the software quality attributes, the fuzzy multi criteria approach has been used to evolve the quality of the software.
Optimizing liquid neural networks: a comparative study of ltcs and cfcs
(IEEE, 2024) Challa, Jagat Sesh
Liquid Time Constant Networks (LTCs) and Closed Form Continuous Networks (CFCs) are recent time-continuous RNN models known for superior expressivity and efficiency in time-series prediction and autonomous navigation. This paper provides an accessible overview of these models and investigates their performance on tasks like Atari ’Breakout’ behavior cloning, steering angle prediction, and Global Horizontal Irradiance (GHI) forecasting. We optimize LTC and CFC cells within network structures, comparing them with LSTM. Detailed experiments highlight the impact of various hyperparameters, underscoring the effectiveness of LTCs and CFCs in dynamic prediction tasks.
Parallelizing OPTICS for Commodity Clusters
(ACM Digital Library, 2015-01) Goyal, Navneet; Goyal, Poonam; Challa, Jagat Sesh
In this paper, we propose an algorithm, DOPTICS, a parallelized version of a popular density based cluster-ordering algorithm OPTICS. Parallelizing OPTICS is challenging because of its strong sequential data access behavior. To achieve high parallelism, a data parallel approach that exploits the underlying indexing structure is proposed. We implement the proposed algorithm for processor nodes in a commodity cluster as well as across cores in a processor. Moreover, the clusters obtained by our algorithm are exactly same as that of classical OPTICS unlike the only existing implementation of the parallel OPTICS. We demonstrate the performance of the proposed algorithm on a commodity cluster which is typically a combination of distributed and shared memory systems. Experimental results on several large real and synthetic data sets with varying dimensions are presented to show speed up and scalability achieved. The speed up obtained is remarkable and is found to scale well with increasing number of processing elements. Performance improvements of the proposed DOPTICS algorithm are due to algorithmic optimizations and parallelization strategy.
Quantification of Software Quality Parameters Using Fuzzy Multi Criteria Approach
(IEEE, 2011) Challa, Jagat Sesh
Software quality is the measure of appropriateness of the design of the software and how well it adheres to that design. There are some metrics and measurements to determine the software quality. Software quality measurement is possible only by quantifying the characteristics affecting the software quality. For measuring the quality, the parameters or quality factors are considered that vary over a domain of discourse. The quality factors stated in ISO/IEC 9126 model are used in this paper. Due to the unpredictable nature of these factors or attributes fuzzy approach has been used to estimate the software quality
A review of the applications of machine learning for prediction and analysis of mechanical properties and microstructures in additive manufacturing
(ACM Digital Library, 2024-12) Challa, Jagat Sesh; Singh, Amit Rajnarayan
This article provides an insightful review of the recent applications of machine learning (ML) techniques in additive manufacturing (AM) for the prediction and amelioration of mechanical properties, as well as the analysis and prediction of microstructures. AM is the modern digital manufacturing technique adopted in various industrial sectors because of its salient features, such as the fabrication of geometrically complex and customized parts, the fabrication of parts with unique properties and microstructures, and the fabrication of hard-to-manufacture materials. The functioning of the AM processes is complicated. Several factors such as process parameters, defects, cooling rates, thermal histories, and machine stability have a prominent impact on AM products’ properties and microstructure. It is difficult to establish the relationship between these AM factors and the AM end product properties and microstructure. Several studies have utilized different ML techniques to optimize AM processes and predict mechanical properties and microstructure. This article discusses the applications of various ML techniques in AM to predict mechanical properties and optimization of AM processes for the amelioration of mechanical properties of end parts. Also, ML applications for segmentation, prediction, and analysis of AM-fabricated material’s microstructures and acceleration of microstructure prediction procedures are discussed in this article.
The role of generative AI tools in shaping mechanical engineering education from an undergraduate perspective
(Springer Nature, 2025-03) Challa, Jagat Sesh; Kumar, Dhruv
This study evaluates the effectiveness of three leading generative AI tools-ChatGPT, Gemini, and Copilot-in undergraduate mechanical engineering education using a mixed-methods approach. The performance of these tools was assessed on 800 questions spanning seven core subjects, covering multiple-choice, numerical, and theory-based formats. While all three AI tools demonstrated strong performance in theory-based questions, they struggled with numerical problem-solving, particularly in areas requiring deep conceptual understanding and complex calculations. Among them, Copilot achieved the highest accuracy (60.38%), followed by Gemini (57.13%) and ChatGPT (46.63%). To complement these findings, a survey of 172 students and interviews with 20 participants provided insights into user experiences, challenges, and perceptions of AI in academic settings. Thematic analysis revealed concerns regarding AI’s reliability in numerical tasks and its potential impact on students’ problem-solving abilities. Based on these results, this study offers strategic recommendations for integrating AI into mechanical engineering curricula, ensuring its responsible use to enhance learning without fostering dependency. Additionally, we propose instructional strategies to help educators adapt assessment methods in the era of AI-assisted learning. These findings contribute to the broader discussion on AI’s role in engineering education and its implications for future learning methodologies.