BITS Faculty Publications

Permanent URI for this communityhttp://localhost:4000/handle/123456789/1867

Browse

Search Results

Now showing 1 - 10 of 47

Classification and study of music genres with multimodal Spectro-Lyrical Embeddings for Music (SLEM)
(Springer, 2024-04) Narang, Pratik
The essence of music is inherently multi-modal – with audio and lyrics going hand in hand. However, there is very less research done to study the intricacies of the multi-modal nature of music, and its relation with genres. Our work uses this multi-modality to present spectro-lyrical embeddings for music representation (SLEM), leveraging the power of open-sourced, lightweight, and state-of-the-art deep learning vision and language models to encode songs. This work summarises extensive experimentation with over 20 deep learning-based music embeddings of a self-curated and hand-labeled multi-lingual dataset of 226 recent songs spread over 5 genres. Our aim is to study the effects of varying the weight of lyrics and spectrograms in the embeddings on the multi-class genre classification. The purpose of this study is to prove that a simple linear combination of both modalities is better than either modality alone. Our methods achieve an accuracy ranging between 81.08% to 98.60% for different genres, by using the K-nearest neighbors algorithm on the multimodal embeddings. We successfully study the intricacies of genres in this representational space, including their misclassification, visual clustering with EM-GMM, and the domain-specific meaning of the multi-modal weight for each genre with respect to ’instrumentalness’ and ’energy’ metadata. SLEM presents one of the first works on an end-to-end method that uses spectro-lyrical embeddings without hand-engineered features.
InFER++: real-world indian facial expression dataset
(IEEE, 2024-08) Challa, Jagat Sesh; Narang, Pratik
Detecting facial expressions is a challenging task in the field of computer vision. Several datasets and algorithms have been proposed over the past two decades; however, deploying them in real-world, in-the-wild scenarios hampers the overall performance. This is because the training data does not completely represent socio-cultural and ethnic diversity; the majority of the datasets consist of American and Caucasian populations. On the contrary, in a diverse and heterogeneous population distribution like the Indian subcontinent, the need for a significantly large enough dataset representing all the ethnic groups is even more critical. To address this, we present InFER++, an India-specific, multi-ethnic, real-world, in-the-wild facial expression dataset consisting of seven basic expressions. To the best of our knowledge, this is the largest India-specific facial expression dataset. Our cross-dataset analysis of RAF-DB vs InFER++ shows that models trained on RAF-DB were not generalizable to ethnic datasets like InFER++. This is because the facial expressions change with respect to ethnic and socio-cultural factors. We also present LiteXpressionNet, a lightweight deep facial expression network that outperforms many existing lightweight models with considerably fewer FLOPs and parameters. The proposed model is inspired by MobileViTv2 architecture, which utilizes GhostNetv2 blocks to increase parametrization while reducing latency and FLOP requirements. The model is trained with a novel objective function that combines early learning regularization and symmetric cross-entropy loss to mitigate human uncertainties and annotation bias in most real-world facial expression datasets.
LDFaceNet: latent diffusion-based network for high-fidelity deepfake generation
(Springer, 2024-12) Narang, Pratik
Over the past decade, there has been tremendous progress in the domain of synthetic media generation. This is mainly due to the powerful methods based on generative adversarial networks (GANs). Very recently, diffusion probabilistic models, which are inspired by non-equilibrium thermodynamics, have taken the spotlight. In the realm of image generation, diffusion models (DMs) have exhibited remarkable proficiency in producing both realistic and heterogeneous imagery through their stochastic sampling procedure. This paper proposes a novel facial swapping module, termed as LDFaceNet (Latent Diffusion based Face Swapping Network), which is based on a guided latent diffusion model that utilizes facial segmentation and facial recognition modules for a conditioned denoising process. The model employs a unique loss function to offer directional guidance to the diffusion process. Notably, LDFaceNet can incorporate supplementary facial guidance for desired outcomes without any retraining. To the best of our knowledge, this represents the first application of the latent diffusion model in the face-swapping task without prior training. The results of this study demonstrate that the proposed method can generate extremely realistic and coherent images by leveraging the potential of the diffusion model for facial swapping, thereby yielding superior visual outcomes and greater diversity.
Balancing the scales: enhancing fairness in facial emotion recognition with latent alignment
(Springer, 2024-12) Narang, Pratik
Automatically recognizing emotional intent using facial expression has been a thoroughly investigated topic in the realm of computer vision. Facial Expression Recognition (FER), being a supervised learning task, relies heavily on substantially large data exemplifying various socio-cultural demographic attributes. Over the past decade, several real-world in-the-wild FER datasets that have been proposed were collected through crowd-sourcing or web-scraping. However, most of these practically used datasets employ a manual annotation methodology for labelling emotional intent, which inherently propagates individual demographic biases. Moreover, these datasets also lack an equitable representation of various socio-cultural demographic groups, thereby inducing a class imbalance. Bias analysis and its mitigation have been investigated across multiple domains and problem settings; however, in the FER domain, this is a relatively lesser explored area. This work leverages representation learning based on latent spaces to mitigate bias in facial expression recognition systems, thereby enhancing a deep learning model’s fairness and overall accuracy.
AffectSRNet: facial emotion-aware super-resolution network
(2025-02) Narang, Pratik
Facial expression recognition (FER) systems in low-resolution settings face significant challenges in accurately identifying expressions due to the loss of fine-grained facial details. This limitation is especially problematic for applications like surveillance and mobile communications, where low image resolution is common and can compromise recognition accuracy. Traditional single-image face super-resolution (FSR) techniques, however, often fail to preserve the emotional intent of expressions, introducing distortions that obscure the original affective content. Given the inherently ill-posed nature of single-image super-resolution, a targeted approach is required to balance image quality enhancement with emotion retention. In this paper, we propose AffectSRNet, a novel emotion-aware super-resolution framework that reconstructs high-quality facial images from low-resolution inputs while maintaining the intensity and fidelity of facial expressions. Our method effectively bridges the gap between image resolution and expression accuracy by employing an expression-preserving loss function, specifically tailored for FER applications. Additionally, we introduce a new metric to assess emotion preservation in super-resolved images, providing a more nuanced evaluation of FER system performance in low-resolution scenarios. Experimental results on standard datasets, including CelebA, FFHQ, and Helen, demonstrate that AffectSRNet outperforms existing FSR approaches in both visual quality and emotion fidelity, highlighting its potential for integration into practical FER applications. This work not only improves image clarity but also ensures that emotion-driven applications retain their core functionality in suboptimal resolution environments, paving the way for broader adoption in FER systems.
AGD-Net: Attention-Guided Dense Inception U-Net for Single-Image Dehazing
(Springer, 2023-12) Chamola, Vinay; Narang, Pratik
Image hazing poses a significant challenge in various computer vision applications, degrading the visual quality and reducing the perceptual clarity of captured scenes. The proposed AGD-Net utilizes a U-Net style architecture with an Attention-Guided Dense Inception encoder-decoder framework. Unlike existing methods that heavily rely on synthetic datasets which are based on CARLA simulation, our model is trained and evaluated exclusively on realistic data, enabling its effectiveness and reliability in practical scenarios. The key innovation of AGD-Net lies in its attention-guided mechanism, which empowers the network to focus on crucial information within hazy images and effectively suppress artifacts during the dehazing process. The dense inception modules further advance the representation capabilities of the model, facilitating the extraction of intricate features from the input images. To assess the performance of AGD-Net, a detailed experimental analysis is conducted on four benchmark haze datasets. The results show that AGD-Net significantly outperforms the state-of-the-art methods in terms of PSNR and SSIM. Moreover, a visual comparison of the dehazing results further validates the superior performance gains achieved by AGD-Net over other methods. By leveraging realistic data exclusively, AGD-Net overcomes the limitations associated with synthetic datasets which are based on CARLA simulation, ensuring its adaptability and effectiveness in real-world circumstances. The proposed AGD-Net offers a robust and reliable solution for single-image dehazing, presenting a significant advancement over existing methods.
Integrating deep learning for visual question answering in Agricultural Disease Diagnostics: Case Study of Wheat Rust
(Springer Nature, 2024) Chamola, Vinay; Narang, Pratik; Rallapall, Srinivas
This paper presents a novel approach to agricultural disease diagnostics through the integration of Deep Learning (DL) techniques with Visual Question Answering (VQA) systems, specifically targeting the detection of wheat rust. Wheat rust is a pervasive and destructive disease that significantly impacts wheat production worldwide. Traditional diagnostic methods often require expert knowledge and time-consuming processes, making rapid and accurate detection challenging. We drafted a new, WheatRustDL2024 dataset (7998 images of healthy and infected leaves) specifically designed for VQA in the context of wheat rust detection and utilized it to retrieve the initial weights on the federated learning server. This dataset comprises high-resolution images of wheat plants, annotated with detailed questions and answers pertaining to the presence, type, and severity of rust infections. Our dataset also contains images collected from various sources and successfully highlights a wide range of conditions (different lighting, obstructions in the image, etc.) in which a wheat image may be taken, therefore making a generalized universally applicable model. The trained model was federated using Flower. Following extensive analysis, the chosen central model was ResNet. Our fine-tuned ResNet achieved an accuracy of 97.69% on the existing data. We also implemented the BLIP (Bootstrapping Language-Image Pre-training) methods that enable the model to understand complex visual and textual inputs, thereby improving the accuracy and relevance of the generated answers. The dual attention mechanism, combined with BLIP techniques, allows the model to simultaneously focus on relevant image regions and pertinent parts of the questions. We also created a custom dataset (WheatRustVQA) with our augmented dataset containing 1800 augmented images and their associated question-answer pairs. The model fetches an answer with an average BLEU score of 0.6235 on our testing partition of the dataset. This federated model is lightweight and can be seamlessly integrated into mobile phones, drones, etc. without any hardware requirement. Our results indicate that integrating deep learning with VQA for agricultural disease diagnostics not only accelerates the detection process but also reduces dependency on human experts, making it a valuable tool for farmers and agricultural professionals. This approach holds promise for broader applications in plant pathology and precision agriculture and can consequently address food security issues.
VADGAN: An Unsupervised GAN Framework for Enhanced Anomaly Detection in Connected and Autonomous Vehicles
(IEEE, 2024-09) Narang, Pratik; Alladi, Tejasvi
The utilization of Connected and Autonomous Vehicles (CAVs) is on the rise, driven by their ability to provide vehicular services such as enhancing vehicle safety, aiding in intelligent decision-making, and ensuring continuous operation. CAVs achieve their objectives by employing wireless Vehicle-to-Everything (V2X) communication within Intelligent Transportation Systems (ITS) to establish connections with vehicles within the same network and roadside units. However, it has been observed that certain vehicles violate network constraints by transmitting erroneous messages, resulting in abnormal behaviour. Consequently, there is a growing need for a system that can verify the accuracy of information broadcast by each vehicle regarding its vehicle coordinates (along with relevant data depending on the application) at designated frequencies and under authorized pseudo-identities. Addressing the limitations faced by prior generative AI model applications, such as Variational Autoencoders (VAEs), this paper presents an unsupervised anomaly detection framework using Generative Adversarial Networks (GANs) optimized for CAVs. Our framework tested across LSTM, RNN, and GRU architectures shows superior performance with LSTM, focusing on vehicle dynamics–position, speed, acceleration, and heading–to effectively identify 11 specific attack types, marking a significant advancement in anomaly detection for CAVs.
Attention-enabled Deep Neural Network for Enhancing UAV-Captured Pavement Imagery in Poor Visibility
(IEEE, 2023) Singh, Ajit Pratap; Srinivas, Rallapalli; Narang, Pratik
Integrating Unmanned Aerial Vehicle (UAV) technology with Artificial Intelligence AI and Computer Vision has revolutionized asset management, particularly pavement health monitoring. However, current AI-based methods often struggle in low-visibility scenarios, limiting their effectiveness. To address this, we present a novel end-to-end deep learning pipeline that detects image degradation using an efficient Attention mechanism and performs subsequent enhancement. This algorithm can be seamlessly integrated into drones or used for post-processing of pavement imagery. Its efficiency allows for scalability, making it a valuable tool for downstream road health monitoring tasks, such as cost estimation for road repairs. Our approach achieves mean accuracies of 93.34% with a mean inference time of 0.154 sec., demonstrating its efficacy.
A Universal Metric for Robust Evaluation of Synthetic Tabular Data
(IEEE, 2024-01) Lahoti, Mukund; Narang, Pratik
Synthetic tabular data generation becomes crucial when real data are limited, expensive to collect, or simply cannot be used due to privacy concerns. However, producing good quality synthetic data is challenging. Several probabilistic, statistical, generative adversarial networks and variational autoencoder-based approaches have been presented for synthetic tabular data generation. Once generated, evaluating the quality of the synthetic data is quite challenging. Some of the traditional metrics have been used in the literature, but there is lack of a common, robust, and single metric. This makes it difficult to properly compare the effectiveness of different synthetic tabular data generation methods. In this article, we propose a new universal metric, TabSynDex, for the robust evaluation of synthetic data. The proposed metric assesses the similarity of synthetic data with real data through different component scores, which evaluate the characteristics that are desirable for “high-quality” synthetic data. Being a single score metric and having an implicit bound, TabSynDex can also be used to observe and evaluate the training of neural network-based approaches. This would help in obtaining insights that was not possible earlier. We present several baseline models for comparative analysis of the proposed evaluation metric with existing generative models. We also give a comparative analysis between TabSynDex and existing synthetic tabular data evaluation metrics. This shows the effectiveness and universality of our metric over the existing metrics.

BITS Faculty Publications

Browse

Filters

Settings

Sort By

Results per page

Search Results