Transformers for vision: a survey on innovative methods for computer vision

Kumar, Dhruv; Chalapathi, G.S.S.

DSpace Home
→
BITS Faculty Publications
→
Department of Computer Science and Information Systems
→
View Item

dc.contributor.author	Kumar, Dhruv
dc.contributor.author	Chalapathi, G.S.S.
dc.date.accessioned	2025-08-14T10:34:46Z
dc.date.available	2025-08-14T10:34:46Z
dc.date.issued	2025-05
dc.identifier.uri	https://ieeexplore.ieee.org/abstract/document/11007557/authors#authors
dc.identifier.uri	http://dspace.bits-pilani.ac.in:8080/jspui/handle/123456789/19201
dc.description.abstract	Transformers have emerged as a groundbreaking architecture in the field of computer vision, offering a compelling alternative to traditional convolutional neural networks (CNNs) by enabling the modeling of long-range dependencies and global context through self-attention mechanisms. Originally developed for natural language processing, transformers have now been successfully adapted for a wide range of vision tasks, leading to significant improvements in performance and generalization. This survey provides a comprehensive overview of the fundamental principles of transformer architectures, highlighting the core mechanisms such as self-attention, multi-head attention, and positional encoding that distinguish them from CNNs. We delve into the theoretical adaptations required to apply transformers to visual data, including image tokenization and the integration of positional embeddings. A detailed analysis of key transformer-based vision architectures such as ViT, DeiT, Swin Transformer, PVT, Twins, and CrossViT are presented, alongside their practical applications in image classification, object detection, video understanding, medical imaging, and cross-modal tasks. The paper further compares the performance of vision transformers with CNNs, examining their respective strengths, limitations, and the emergence of hybrid models. Finally, current challenges in deploying ViTs, such as computational cost, data efficiency, and interpretability, and explore recent advancements and future research directions including efficient architectures, self-supervised learning, and multimodal integration are discussed.	en_US
dc.language.iso	en	en_US
dc.publisher	IEEE	en_US
dc.subject	Computer Science	en_US
dc.subject	EEE	en_US
dc.subject	Transformers	en_US
dc.subject	Computer architecture	en_US
dc.subject	Computer vision	en_US
dc.subject	Convolutional neural networks (CNNs)	en_US
dc.title	Transformers for vision: a survey on innovative methods for computer vision	en_US
dc.type	Article	en_US

Files in this item

Files	Size	Format	View
There are no files associated with this item.

This item appears in the following Collection(s)

Department of Computer Science and Information Systems [1099]

Show simple item record

Search DSpace

Advanced Search

Browse

All of DSpace
This Collection
- By Issue Date
- Authors
- Titles
- Subjects

Transformers for vision: a survey on innovative methods for computer vision

Files in this item

This item appears in the following Collection(s)

Search DSpace

Browse

All of DSpace

This Collection

My Account