Department of Computer Science and Information Systems
Permanent URI for this collectionhttp://localhost:4000/handle/123456789/1928
Browse
41 results
Search Results
Item Classification and study of music genres with multimodal Spectro-Lyrical Embeddings for Music (SLEM)(Springer, 2024-04) Narang, PratikThe essence of music is inherently multi-modal – with audio and lyrics going hand in hand. However, there is very less research done to study the intricacies of the multi-modal nature of music, and its relation with genres. Our work uses this multi-modality to present spectro-lyrical embeddings for music representation (SLEM), leveraging the power of open-sourced, lightweight, and state-of-the-art deep learning vision and language models to encode songs. This work summarises extensive experimentation with over 20 deep learning-based music embeddings of a self-curated and hand-labeled multi-lingual dataset of 226 recent songs spread over 5 genres. Our aim is to study the effects of varying the weight of lyrics and spectrograms in the embeddings on the multi-class genre classification. The purpose of this study is to prove that a simple linear combination of both modalities is better than either modality alone. Our methods achieve an accuracy ranging between 81.08% to 98.60% for different genres, by using the K-nearest neighbors algorithm on the multimodal embeddings. We successfully study the intricacies of genres in this representational space, including their misclassification, visual clustering with EM-GMM, and the domain-specific meaning of the multi-modal weight for each genre with respect to ’instrumentalness’ and ’energy’ metadata. SLEM presents one of the first works on an end-to-end method that uses spectro-lyrical embeddings without hand-engineered features.Item InFER++: real-world indian facial expression dataset(IEEE, 2024-08) Challa, Jagat Sesh; Narang, PratikDetecting facial expressions is a challenging task in the field of computer vision. Several datasets and algorithms have been proposed over the past two decades; however, deploying them in real-world, in-the-wild scenarios hampers the overall performance. This is because the training data does not completely represent socio-cultural and ethnic diversity; the majority of the datasets consist of American and Caucasian populations. On the contrary, in a diverse and heterogeneous population distribution like the Indian subcontinent, the need for a significantly large enough dataset representing all the ethnic groups is even more critical. To address this, we present InFER++, an India-specific, multi-ethnic, real-world, in-the-wild facial expression dataset consisting of seven basic expressions. To the best of our knowledge, this is the largest India-specific facial expression dataset. Our cross-dataset analysis of RAF-DB vs InFER++ shows that models trained on RAF-DB were not generalizable to ethnic datasets like InFER++. This is because the facial expressions change with respect to ethnic and socio-cultural factors. We also present LiteXpressionNet, a lightweight deep facial expression network that outperforms many existing lightweight models with considerably fewer FLOPs and parameters. The proposed model is inspired by MobileViTv2 architecture, which utilizes GhostNetv2 blocks to increase parametrization while reducing latency and FLOP requirements. The model is trained with a novel objective function that combines early learning regularization and symmetric cross-entropy loss to mitigate human uncertainties and annotation bias in most real-world facial expression datasets.Item LDFaceNet: latent diffusion-based network for high-fidelity deepfake generation(Springer, 2024-12) Narang, PratikOver the past decade, there has been tremendous progress in the domain of synthetic media generation. This is mainly due to the powerful methods based on generative adversarial networks (GANs). Very recently, diffusion probabilistic models, which are inspired by non-equilibrium thermodynamics, have taken the spotlight. In the realm of image generation, diffusion models (DMs) have exhibited remarkable proficiency in producing both realistic and heterogeneous imagery through their stochastic sampling procedure. This paper proposes a novel facial swapping module, termed as LDFaceNet (Latent Diffusion based Face Swapping Network), which is based on a guided latent diffusion model that utilizes facial segmentation and facial recognition modules for a conditioned denoising process. The model employs a unique loss function to offer directional guidance to the diffusion process. Notably, LDFaceNet can incorporate supplementary facial guidance for desired outcomes without any retraining. To the best of our knowledge, this represents the first application of the latent diffusion model in the face-swapping task without prior training. The results of this study demonstrate that the proposed method can generate extremely realistic and coherent images by leveraging the potential of the diffusion model for facial swapping, thereby yielding superior visual outcomes and greater diversity.Item Balancing the scales: enhancing fairness in facial emotion recognition with latent alignment(Springer, 2024-12) Narang, PratikAutomatically recognizing emotional intent using facial expression has been a thoroughly investigated topic in the realm of computer vision. Facial Expression Recognition (FER), being a supervised learning task, relies heavily on substantially large data exemplifying various socio-cultural demographic attributes. Over the past decade, several real-world in-the-wild FER datasets that have been proposed were collected through crowd-sourcing or web-scraping. However, most of these practically used datasets employ a manual annotation methodology for labelling emotional intent, which inherently propagates individual demographic biases. Moreover, these datasets also lack an equitable representation of various socio-cultural demographic groups, thereby inducing a class imbalance. Bias analysis and its mitigation have been investigated across multiple domains and problem settings; however, in the FER domain, this is a relatively lesser explored area. This work leverages representation learning based on latent spaces to mitigate bias in facial expression recognition systems, thereby enhancing a deep learning model’s fairness and overall accuracy.Item AffectSRNet: facial emotion-aware super-resolution network(2025-02) Narang, PratikFacial expression recognition (FER) systems in low-resolution settings face significant challenges in accurately identifying expressions due to the loss of fine-grained facial details. This limitation is especially problematic for applications like surveillance and mobile communications, where low image resolution is common and can compromise recognition accuracy. Traditional single-image face super-resolution (FSR) techniques, however, often fail to preserve the emotional intent of expressions, introducing distortions that obscure the original affective content. Given the inherently ill-posed nature of single-image super-resolution, a targeted approach is required to balance image quality enhancement with emotion retention. In this paper, we propose AffectSRNet, a novel emotion-aware super-resolution framework that reconstructs high-quality facial images from low-resolution inputs while maintaining the intensity and fidelity of facial expressions. Our method effectively bridges the gap between image resolution and expression accuracy by employing an expression-preserving loss function, specifically tailored for FER applications. Additionally, we introduce a new metric to assess emotion preservation in super-resolved images, providing a more nuanced evaluation of FER system performance in low-resolution scenarios. Experimental results on standard datasets, including CelebA, FFHQ, and Helen, demonstrate that AffectSRNet outperforms existing FSR approaches in both visual quality and emotion fidelity, highlighting its potential for integration into practical FER applications. This work not only improves image clarity but also ensures that emotion-driven applications retain their core functionality in suboptimal resolution environments, paving the way for broader adoption in FER systems.Item VADGAN: An Unsupervised GAN Framework for Enhanced Anomaly Detection in Connected and Autonomous Vehicles(IEEE, 2024-09) Narang, Pratik; Alladi, TejasviThe utilization of Connected and Autonomous Vehicles (CAVs) is on the rise, driven by their ability to provide vehicular services such as enhancing vehicle safety, aiding in intelligent decision-making, and ensuring continuous operation. CAVs achieve their objectives by employing wireless Vehicle-to-Everything (V2X) communication within Intelligent Transportation Systems (ITS) to establish connections with vehicles within the same network and roadside units. However, it has been observed that certain vehicles violate network constraints by transmitting erroneous messages, resulting in abnormal behaviour. Consequently, there is a growing need for a system that can verify the accuracy of information broadcast by each vehicle regarding its vehicle coordinates (along with relevant data depending on the application) at designated frequencies and under authorized pseudo-identities. Addressing the limitations faced by prior generative AI model applications, such as Variational Autoencoders (VAEs), this paper presents an unsupervised anomaly detection framework using Generative Adversarial Networks (GANs) optimized for CAVs. Our framework tested across LSTM, RNN, and GRU architectures shows superior performance with LSTM, focusing on vehicle dynamics–position, speed, acceleration, and heading–to effectively identify 11 specific attack types, marking a significant advancement in anomaly detection for CAVs.Item EraisNET: An Optical Flow based 3D ConvNET for Erasing Obstructions(IEEE, 2022) Narang, Pratik; Rajput, Amitesh SinghImages captured from behind a fence, window, or during rain generally face occlusions. Though prior works have addressed the problems of individually de-raining, reflection, and occlusion removal, a common approach that removes all the obstruction has found little attention in the literature. In this paper, we address the image occlusion problem by proposing a deep learning-based approach wherein the proposed method uses motion differences between two images and extracts important moving features from videos to separate the background and the obstruction. To accomplish this task, a novel 3D-convolution architecture is introduced, which is trained with synthetically blended videos. We have used learned layer-based CNN methods combined with dense-optical flow with generative networks for better output images. Moreover, a dataset for obstruction removal with sequences for reflection and fencing removal is proposed. The proposed approach is well experimented over a different variety of images and is found as a good candidate against state-of-the-art schemes.Item Unwanted Traffic Identification in Large-Scale University Networks: A Case Study(Springer, 2016) Narang, PratikTo mitigate the malicious impact of P2P traffic on University networks, in this article the authors have proposed the design of payload-oblivious privacy-preserving P2P traffic detectors. The proposed detectors do not rely on payload signatures, and hence, are resilient to P2P client and protocol changes—a phenomenon which is now becoming increasingly frequent with newer, more popular P2P clients/protocols. The article also discusses newer designs to accurately distinguish P2P botnets from benign P2P applications. The datasets gathered from the testbed and other sources range from Gigabytes to Terabytes containing both unstructured and structured data assimilated through running of various applications within the University network. The approaches proposed in this article describe novel ways to handle large amounts of data that is collected at unprecedented scale in authors’ University network.Item Feature Selection for Detection of Peer-to-Peer Botnet Traffic(ACM Digital Library, 2013) Narang, PratikThe use of anomaly-based classification of intrusions has increased significantly for Intrusion Detection Systems. Large number of training data samples and a good ‘feature set’ are two primary requirements to build effective classification models with machine learning algorithms. Since the amount of data available for malicious traffic will often be small compared to the available traces of benign traffic, extraction of ‘good’ features which enable detection of malicious traffic is a challenging area of work. This research work presents preliminary results of comparison of performance of three different feature selection algorithms - Correlation based feature selection, Consistency based subset evaluation and Principal component analysison three different Machine learning techniques- namely Decision trees, Na¨ıve Bayes classifier, and Bayesian Network classifier. These algorithms are evaluated for the detection of Peer-to-Peer (P2P) based botnet traffic.Item PeerShark: Detecting Peer-to-Peer Botnets by Tracking Conversations(IEEE, 2014) Narang, PratikThe decentralized nature of Peer-to-Peer (P2P) botnets makes them difficult to detect. Their distributed nature also exhibits resilience against take-down attempts. Moreover, smarter bots are stealthy in their communication patterns, and elude the standard discovery techniques which look for anomalous network or communication behavior. In this paper, we propose Peer Shark, a novel methodology to detect P2P botnet traffic and differentiate it from benign P2P traffic in a network. Instead of the traditional 5-tuple 'flow-based' detection approach, we use a 2-tuple 'conversation-based' approach which is port-oblivious, protocol-oblivious and does not require Deep Packet Inspection. Peer Shark could also classify different P2P applications with an accuracy of more than 95%.