A Comparative Analysis of Transformer-Based Models for Document Visual Question Answering

dc.contributor.authorSharma, Yashvardhan
dc.date.accessioned2024-11-13T08:55:55Z
dc.date.available2024-11-13T08:55:55Z
dc.date.issued2023-06
dc.description.abstractVisual question answering (VQA) is one of the most exciting problems of computer vision and natural language processing tasks. It requires understanding and reasoning of the image to answer a human query. Text Visual Question Answering (Text-VQA) and Document Visual Question Answering (DocVQA) are the two sub problems of the VQA, which require extracting the text from the usual scene and document images. Since answering questions about documents requires an understanding of the layout and writing patterns, the models that perform well on the Text-VQA task perform poorly on the DocVQA task. As the transformer-based models achieve state-of-the-art results in deep learning fields, we train and fine-tune various transformer-based models (such as BERT, ALBERT, RoBERTa, ELECTRA, and Distil-BERT) to examine their validation accuracy. This paper provides a detailed analysis of various transformer models and compares their accuracies on the DocVQA task.en_US
dc.identifier.urihttps://link.springer.com/chapter/10.1007/978-981-99-0609-3_16
dc.identifier.urihttps://dspace.bits-pilani.ac.in/handle/123456789/16357
dc.language.isoenen_US
dc.publisherSpringeren_US
dc.subjectComputer Scienceen_US
dc.subjectVisual Question Answering (VQA)en_US
dc.subjectText Visual Question Answering (Text-VQA)en_US
dc.subjectDocument Visual Question Answering (DocVQA)en_US
dc.titleA Comparative Analysis of Transformer-Based Models for Document Visual Question Answeringen_US
dc.typeArticleen_US

Files

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: