Department of Computer Science and Information Systems

Permanent URI for this collectionhttp://localhost:4000/handle/123456789/1928

Browse

Search Results

Now showing 1 - 7 of 7

Transformers for vision: a survey on innovative methods for computer vision
(IEEE, 2025-05) Kumar, Dhruv; Chalapathi, G.S.S.
Transformers have emerged as a groundbreaking architecture in the field of computer vision, offering a compelling alternative to traditional convolutional neural networks (CNNs) by enabling the modeling of long-range dependencies and global context through self-attention mechanisms. Originally developed for natural language processing, transformers have now been successfully adapted for a wide range of vision tasks, leading to significant improvements in performance and generalization. This survey provides a comprehensive overview of the fundamental principles of transformer architectures, highlighting the core mechanisms such as self-attention, multi-head attention, and positional encoding that distinguish them from CNNs. We delve into the theoretical adaptations required to apply transformers to visual data, including image tokenization and the integration of positional embeddings. A detailed analysis of key transformer-based vision architectures such as ViT, DeiT, Swin Transformer, PVT, Twins, and CrossViT are presented, alongside their practical applications in image classification, object detection, video understanding, medical imaging, and cross-modal tasks. The paper further compares the performance of vision transformers with CNNs, examining their respective strengths, limitations, and the emergence of hybrid models. Finally, current challenges in deploying ViTs, such as computational cost, data efficiency, and interpretability, and explore recent advancements and future research directions including efficient architectures, self-supervised learning, and multimodal integration are discussed.
Comparative study on deep neural network models for crop classification using time series polsar and optical data
(ISPRS, 2018-11) Phartiyal, Gopal Singh
Crop classification is an important task in many crop monitoring applications. Satellite remote sensing has provided easy, reliable, and fast approaches to crop classification task. In this study, a comparative analysis is made on the performances of various deep neural network (DNN) models for crop classification task using polarimetric synthetic aperture radar (PolSAR) and optical satellite data. For PolSAR data, Sentinel 1 dual pol SAR data is used. Sentinel 2 multispectral data is used as optical data. Five land cover classes including two crop classes of the season are taken. Time series data over the period of one crop cycle is used. Training and testing samples are measured and collected directly from the ground over the study region. Various convolutional neural network (CNN) and long short-term memory (LSTM) models are implemented, analysed, evaluated, and compared. Models are evaluated on the basis of classification accuracy and generalization performance.
A mixed spectral and spatial convolutional neural network for land cover classification using SAR and optical data
(EGU 2018, 2018) Phartiyal, Gopal Singh
Today, both SAR and optical data are available with good spatial and temporal resolutions. The two data modalities complement each other in many applications. There are numerous approaches to process the two data modalities, separately or combined. Domain or modality specific approaches such as polarimetric decomposition techniques or reflectance based techniques cannot work with the two datasets combined together. Data fusion approaches incur information loss during the process and are highly application specific. Machine learning (ML) approaches can operate on the combined dataset but have their own advantages and disadvantages. There is a need to explore new ML based approaches to achieve higher performance. Convolutional neural networks (CNNs) are young, trending, and promising ML tools in remote sensing applications. CNNs have the capability to learn complex features exclusively from data. Data from the two modalities can thus be brought together and processed with increased performance. In this paper an attempt is made to analyze CNN capabilities to perform land cover classification using multi-sensor data. SAR data used in this study is L band fully polarimetric PALSAR 2 data with 6 meter spatial resolution. Three basic polarimetric bands, namely, HH, HV, and VV, and four derived bands (polarization signatures) are used. Six multispectral Landsat 8 bands, pan sharpened and resampled at 6 meter spatial resolution, are used as optical data. All 13 features are stacked together and fed as input data to the proposed CNN. The areas selected for study are Haridwar and Roorkee regions of northern India. This study introduces a CNN where convolution is performed both spatially and spectrally. We show how this is an advantage over performing only spatial convolution. Five land cover classes namely, urban, bare soil, water, dense vegetation, and agriculture are considered. The CNN is trained on more than 1200 ground truth class data points measured directly on the terrain. The classification shows results with good generalization. Comparison with other classifiers such as SVMs shows that the proposed approach provides better classification results in terms of generalization, although the cross-validation accuracy is on the same order. The evaluation of the generalization of the classified image is done using ground truth knowledge on selected subset areas in the study area.
Permuted spectral and permuted spectral-spatial cnn models for polsar-multispectral data based land cover classification
(Taylor & Francis, 2020-12) Phartiyal, Gopal Singh
It is a challenge to develop methods which can process the polarimetric synthetic aperture radar (PolSAR) and multispectral (MS) data modalities together without losing information from either for remote sensing applications. This paper presents a study which attempts to introduce novel deep learning-based remote sensing data processing frameworks that utilize convolutional neural networks (CNNs) in both spatial and spectral domains to perform land cover (LC) classification with PolSAR-MS data. Also since earth observation remotely sensed data have usually larger spectral depth than normal camera image data, exploiting the spectral information in remote sensing (RS) data is crucial as well. In fact, convolutions in the sub-spectral space are intuitive and alternative to the process of feature selection. Recently, researchers have gained success in exploiting the spectral information of RS data, especially the hyperspectral data with CNNs. In this paper, exploitation of the spectral information in the PolSAR-MS data via a permuted localized spectral convolution along with localized spatial convolution is proposed. Further, the study in this paper also establishes the significance of performing permuted localized spectral convolutions over non-localized or localized spectral convolutions. Two models are proposed, namely a permuted local spectral convolutional network (Perm-LS-CNN) and a permuted local spectral-spatial convolutional network (Perm-LSS-CNN). These models are trained on ground truth class data points measured directly on the terrain. The evaluation of the generalization performance is done using ground truth knowledge on selected well-known regions in the study areas. Comparison with other popular machine learning classifiers shows that the Perm-LSS-CNN model provides better classification results in terms of both accuracy and generalization.
An attention-based deep network for plant disease classification
(2024) Bera, Asish
Plant disease classification using machine learning in a real agricultural field environment is a difficult task. Often, an automated plant disease diagnosis method might fail to capture and interpret discriminatory information due to small variations among leaf sub-categories. Yet, modern Convolutional Neural Networks (CNNs) have achieved decent success in discriminating various plant diseases using leave images. A few existing methods have applied additional pre-processing modules or sub-networks to tackle this challenge. Sometimes, the feature maps ignore partial information for holistic description by part-mining. A deep CNN that emphasizes integration of partial descriptiveness of leaf regions is proposed in this work. The efficacious attention mechanism is integrated with high-level feature map of a base CNN for enhancing feature representation. The proposed method focuses on important diseased areas in leaves, and employs an attention weighting scheme for utilizing useful neighborhood information. The proposed Attention-based network for Plant Disease Classification (APDC) method has achieved state-of-the-art performances on four public plant datasets containing visual/thermal images. The best top-1 accuracies attained by the proposed APDC are: PlantPathology 97.74%, PaddyCrop 99.62%, PaddyDoctor 99.65%, and PlantVillage 99.97%. These results justify the suitability of proposed method.
Poa-net: dance poses and activity classification using convolutional neural networks
(IEEE, 2024) Bera, Asish
Dance poses represent a complex human body-part movement, and express emotions and gesture. Dance pose classification is a challenging problem in computer vision. Convolutional Neural Networks (CNNs) have witnessed significant performance improvements in recognizing dance poses from images and videos. Most of the dance datasets in existing works are video-based and are not available publicly. This work contributes an image dataset representing 8 new dance styles blended with the Indian and international dance themes, called Dance-8. These unique 8 dance styles are combined with the Dance-12 public dataset for improving the posture diversity and dataset size. This extended dataset is called Dance-20. A custom CNN is developed for dance POses and Activity classification, named POA-Net. All three dance datasets have been evaluated using standard base CNNs and POA-Net. The POA-Net has attained an accuracy of 73.27% on Dance-8, 82.10% on Dance-12, and 73.10% on Dance-20. These performances are better than those of standard backbones, such as VGG16 and Inception-V3. The best accuracy of 81.57%, 85.08% and 76.73% has been achieved by MobileNet-v2 on these Dance-8, 12, and 20 datasets, respectively. Moreover, POA-Net has achieved the state-of-the-art accuracy of 99.74% on the DIAT, which is a radar-based human action image dataset
Fine-Grained Sports, Yoga, and Dance Postures Recognition: A Benchmark Analysis
(IEEE, 2023-07) Bera, Asish
Human body-pose estimation is a complex problem in computer vision. Recent research interests have been widened specifically on the sports, yoga, and dance (SYD) postures for maintaining health conditions. The SYD pose categories are regarded as a fine-grained image classification (FGIC) task due to the complex movement of body parts. Deep convolutional neural networks (CNNs) have attained significantly improved performance in solving various human body-pose estimation problems. Though decent progress has been achieved in yoga postures recognition using deep-learning techniques, fine-grained sports and dance recognition necessitates ample research attention. However, no benchmark public image dataset with sufficient interclass and intraclass variations is available yet to address sports and dance postures classification. To solve this limitation, we have proposed two image datasets, one for 102 sport categories and another for 12 dance styles. Two public datasets, Yoga-82 that contains 82 classes and Yoga-107 that represents 107 classes, are collected for yoga postures. These four SYD datasets are experimented with the proposed deep model, SYD-Net, which integrates a patch-based attention (PbA) mechanism on top of standard backbone CNNs. The PbA module leverages the self-attention mechanism that learns contextual information from a set of uniform and multiscale patches and emphasizes discriminative features to understand the semantic correlation among patches. Moreover, random erasing data augmentation is applied to improve performance. The proposed SYD-Net has achieved state-of-the-art accuracy on Yoga-82 using five base CNNs. SYD-Net’s accuracy on other datasets is remarkable, implying its efficiency. Our Sports-102 and Dance-12 datasets are publicly available at https://sites.google.com/view/syd-net/home

Department of Computer Science and Information Systems

Browse

Filters

Settings

Sort By

Results per page

Search Results