A multi-modal attentive framework that can interpret text (MMAT)

Sharma, Yashvardhan

DSpace Home
→
BITS Faculty Publications
→
Department of Computer Science and Information Systems
→
View Item

dc.contributor.author	Sharma, Yashvardhan
dc.date.accessioned	2025-08-26T03:56:49Z
dc.date.available	2025-08-26T03:56:49Z
dc.date.issued	2025-07
dc.identifier.uri	https://ieeexplore.ieee.org/abstract/document/11072709
dc.identifier.uri	http://dspace.bits-pilani.ac.in:8080/jspui/handle/123456789/19232
dc.description.abstract	Deep learning algorithms have demonstrated exceptional performance on various computer vision and natural language processing tasks. However, for machines to learn information signals, they must understand and have enough reasoning power to respond to general questions based on the linguistic features present in images. Questions such as “What temperature is my oven set to?” need the models to understand objects in the images visually and then spatially identify the text associated with them. The existing Visual Question Answering model fails to recognize linguistic features present in the images, which is crucial for assisting the visually impaired. This paper aims to deal with the task of a visual question answering system that can do reasoning with text, optical character recognition (OCR), and visual modalities. The proposed Visual Question Answering model focuses on the image’s most relevant part by using an attention mechanism and passing all the features to the fusion encoder after getting pairwise attention, where the model is inclined toward the OCR-Linguistic features. The proposed model uses the dynamic pointer network instead of classification for iterative answer prediction with a focal loss function to overcome the class imbalance problem. On the TextVQA dataset, the proposed model obtains an accuracy of 46.8% and an average of 55.21% on the STVQA dataset. The results indicate the effectiveness of the proposed approach and suggest a Multi-Modal Attentive Framework that can learn individual text, object, and OCR features and then predict answers based on the text in the image.	en_US
dc.language.iso	en	en_US
dc.publisher	IEEE	en_US
dc.subject	Computer Science	en_US
dc.subject	Visual question answering system (VQA)	en_US
dc.subject	Text visual question answering system (Text-VQA)	en_US
dc.subject	Optical character recognition (OCR)	en_US
dc.subject	Attention mechanism	en_US
dc.subject	Natural Language Processing (NLP)	en_US
dc.title	A multi-modal attentive framework that can interpret text (MMAT)	en_US
dc.type	Article	en_US

Files in this item

Files	Size	Format	View
There are no files associated with this item.

This item appears in the following Collection(s)

Department of Computer Science and Information Systems [1099]

Show simple item record

Search DSpace

Advanced Search

Browse

All of DSpace
This Collection
- By Issue Date
- Authors
- Titles
- Subjects

A multi-modal attentive framework that can interpret text (MMAT)

Files in this item

This item appears in the following Collection(s)

Search DSpace

Browse

All of DSpace

This Collection

My Account