Offensive Language Classification of Code-Mixed Tamil with Keras
No Thumbnail Available
Date
2021
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
CEUR-WS
Abstract
This paper presents the method adopted for completing Task 1 of Dravidian-CodeMix-HASOC (Hate
Speech and Offensive Content Identification in English and Indo-European Languages) Shared Task
proposed by the Forum of Information Retrieval Evaluation in 2021, for offensive language detection.
For detecting offensive language, a custom model architecture using convolutional neural networks was
created using Keras for supervised learning, and trained on a dataset of YouTube comments, written in
code-mixed Tamil in both Roman and Tamil scripts. The 5 layer neural network was built only using
Keras, and required simple tokenized data, padded to an appropriate length. Recurrent neural networks
and transfer learning were not used, and an F-score of 0.835 was achieved with the created CNN model.
Description
Keywords
Computer Science, Offensive language detection, Code-Mixed text, Tamil, HASOC