Visual Question Answering Analysis: Datasets, Methods, and Image Featurization Techniques

Sharma, Yashvardhan

DSpace Home
→
BITS Faculty Publications
→
Department of Computer Science and Information Systems
→
View Item

dc.contributor.author	Sharma, Yashvardhan
dc.date.accessioned	2024-11-12T11:11:57Z
dc.date.available	2024-11-12T11:11:57Z
dc.date.issued	2023
dc.identifier.uri	https://www.scitepress.org/Papers/2023/116559/116559.pdf
dc.identifier.uri	http://dspace.bits-pilani.ac.in:8080/jspui/handle/123456789/16354
dc.description.abstract	Holistic scene understanding is a long-standing objective of core tenets of Artificial Intelligence (AI). Multimodal tasks that aim to synergize capabilities spanning multiple domains, such as visual-linguistic capabilities, into intelligent systems are thus a desideratum for the next step in AI. Visual Question Answering (VQA) systems that integrate Computer Vision and Natural Language Processing tasks into the task of answering natural language questions about an image represent one such domain. There is a need to explore Deep Learning techniques that can help to improve such systems beyond the language biases of real-world priors that presently hinder them from serving as a veritable touchstone for holistic scene understanding. Furthermore, the effectiveness of Transformer architecture for the image featurization pipeline of VQA systems remains untested. Hence, an exhaustive study on the performance of various model architectures with varied training conditions on VQA datasets like VizWiz and VQA v2 is imperative to further this area of research. This study explores architectures that utilize image and question co-attention for the task of VQA and several CNN architectures, including ResNet, VGG, EfficientNet, and DenseNet. Vision Transformer architecture is also explored for image featurization, and a myriad of loss functions such as cross-entropy, focal loss, and UniLoss are employed for training the models. Finally, the trained model is deployed using Flask, and a GUI for the same has been implemented that lets users input an image and accompanying questions about the image to generate an answer in response.	en_US
dc.language.iso	en	en_US
dc.publisher	ICPRAM	en_US
dc.subject	Computer Science	en_US
dc.subject	Computer Vision	en_US
dc.subject	Natural Language Processing (NLP)	en_US
dc.subject	Visual Question Answering (VQA)	en_US
dc.subject	Attention Mechanism, Convolutional Neural Networks.	en_US
dc.subject	Attention Mechanism	en_US
dc.subject	Convolutional Neural Networks	en_US
dc.title	Visual Question Answering Analysis: Datasets, Methods, and Image Featurization Techniques	en_US
dc.type	Article	en_US

Files in this item

Files	Size	Format	View
There are no files associated with this item.

This item appears in the following Collection(s)

Department of Computer Science and Information Systems [1099]

Show simple item record

Search DSpace

Advanced Search

Browse

All of DSpace
This Collection
- By Issue Date
- Authors
- Titles
- Subjects

Visual Question Answering Analysis: Datasets, Methods, and Image Featurization Techniques

Files in this item

This item appears in the following Collection(s)

Search DSpace

Browse

All of DSpace

This Collection

My Account