Abstract:
Hate speech detection is an important research area owing to the severe effects of hate speech on the society. Hence automated hate speech detection based on textual data assumed a pivotal role among the research groups. Moreover, the exponential growth of multimodal content on social media like hateful memes poses the need for building efficient machine learning which can handle such content. In this work, we explore different fusion techniques and compare their performance for the multimodal hate speech identification task. In particular, we test new combinations of fusing textual and visual models to improve the performance of the models on the MMHS150K dataset. We apply the corresponding preprocessing techniques for the text and images of tweets. Then, we use a pre-trained BERT model for textual feature extraction and Inceptionv3, Inception ResNet, ResNext to extract features from the images. We apply different early fusion techniques like concatenation and product rule and late fusion techniques namely distribution summation, performance weighting, logarithmic opinion polling, rules learned from training on probabilities to efficiently fuse vision and text modalities. We also employ the SMOTE oversampling technique and random undersampling to deal with the class imbalance in the MMHS150K dataset. Our proposed model has achieved an accuracy of 67.7% which is comparable to the state-of-the-art.